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Message from the Program Chairs 



Data Warehousing and Knowledge Discovery technology is emerging as a key 
technology for enterprises that wish to improve their data analysis, decision support 
activities, and the automatic extraction of knowledge from data. 

The first international Conference on Data Warehousing and Knowledge Discovery 
(DaWaK’99) sought to fill an important gap in both data warehousing and knowledge 
discovery. The conference focused on the logical and the physical design of data 
warehousing and knowledge discovery systems. The scope of the papers covers the 
most recent and relevant topics in the areas of data warehousing, multidimensional 
databases, OLAP, knowledge discovery and data & web mining, and time series 
databases. 

These proceedings present the research and experience papers selected for 
presentation at DaWaK99, held in Florence, Italy. We received more than 88 papers 
from over 22 countries and the program committee finally selected 3 1 long papers and 
9 short papers. The conference program also included two invited talks, namely, 
“Dynamic Data Warehousing” by Dr. Umeshwar Dayal, HP Labs, U.S.A, and “On 
Tractable Queries and Constraints” by Prof. G. Gottlob, Technical University of 
Vienna, Austria. 

We would like to thank the conference general chair (Prof. Yahiko Kambayashi), 
DEXA'99 workshop general chair (Prof. Roland Wagner) and the organising 
committee of the 10th International Conference on Database and Expert Systems 
Applications (DEXA'99) for their support and cooperation. Many many thanks are 
due to Ms Gabriela Wagner for providing a great deal of help and assistance. We are 
very indebted to all program committee members and the referees, who have 
reviewed the papers in a very careful and timely manner. We would also like to thank 
all the authors who submitted their papers to this conference. 

Finally, our thanks go out to all of you who attended the conference here in Florence. 
We hope you to had a great week of fun, sight-seeing, and of course, excellent 
technical discussions. 

Mukesh Mohania and A Min Tjoa 
Programe Committee Chairs 
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Department of Electrical and Computer Engineering 
Computer Science Division 
National Technical University of Athens 
Zographou 157 73, Athens, Greece 
{dth , t imos }Sdblab . ece . ntua . gr 



Abstract. A data warehouse (DW) can be seen as a set of materialized 
views defined over remote base relations. When a query is posed, it is 
evaluated locally, using the materialized views, without accessing the 
original information sources. The DWs are dynamic entities that evolve 
continuously over time. As time passes, new queries need to be answered 
by them. Some of these queries can be answered using exclusively the 
materialized views. In general though new views need to be added to the 
DW, 

In this paper we investigate the problem of incrementally designing a 
DW when new queries need to be answered and extra space is allocated 
for view materialization. Based on an AND/OR dag representation of 
multiple queries, we model the problem as a state space search problem. 

We design incremental algorithms for selecting a set of new views to 
additionally materialize in the DW that fits in the extra space, allows a 
complete rewriting of the new queries over the materialized views and 
minimizes the combined new query evaluation and new view maintenance 
cost. 

1 Introduction 

Data warehouses store large volumes of data which are frequently used by com- 
panies for On-Line Analytical Processing (OLAP) and Decision Support Sys- 
tem (DSS) applications. Data warehousing is also an approach for integrating 
data from multiple, possibly very large, distributed, heterogeneous databases 
and other information sources. 

A Data Warehouse (DW) can be abstractly seen as a set of materialized 
views defined over a set of (remote) base relations. OLAP and DSS applications 
make heavy use of complex grouping/aggregation queries. In order to ensure 
high query performance, the queries are evaluated locally at the DW, using 
exclusively the materialized views, without accessing the original base relations. 

When the base relations change, the materialized at the DW views need to 
be updated. Different maintenance policies (deferred or immediate) and main- 
tenance strat egies (incremental or rematerialization) can be applied. 

* Research supported by the European Commission under the ESPRIT Program LTR 
project ”DWQ: Foundations of Data Warehouse Quality” 

Mukesh Mohania and A Min Tjoa (Eds.): DaWaK’99, LNCS 1676, pp. 1-10, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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1.1 The problem: Dynamic Data Warehouse Design 

DWs are dynamic entities that evolve continuously over time. As time passes, 
new queries need to be answered by them. Some of the new queries can be 
answered by the views already materialized in the DW. Other new queries, in 
order to be answered by the DW, necessitate the materialization of new views. 
In any case, in order for a query to be answerable by the DW, there must exist 
a complete rewriting [5] of it over the (old and new) materialized views. Such 
a rewriting can be exclusively over the old views, or exclusively over the new 
views, or partially over the new and partially over the old views. If new views 
need to be materialized, extra space need to be allocated for materialization. 

One way for dealing with this issue is to re-implement the DW from scratch 
for the old and the new queries. This is the static approach to the DW de- 
sign problem. Re-implementing the DW from scratch though, has the following 
disadvantages: 

(a) Selecting the appropriate set of views for materialization that satisfies the 
conditions mentioned above is a long and complicated procedure [9, 10, 8]. 

(b) During the materialization of the views in the DW, some old view material- 
izations are removed from the DW while new ones are added to it. Therefore, 
the DW is no more fully operational. Given the sizes of actual DWs and the 
complexity of views that need to be computed, the load window required to 
make the DW operational may become unacceptably long. 

In this paper we address the Dynamic DW Design Problem: given a DW (a set 
of materialized views), a set of new queries to be answered by it and possibly 
some extra space allocated for materialization to the DW, select a set of new 
views to additionally materialize in the DW such that: 

1. The new materialized views fit in the extra allocated space. 

2. All the new queries can be answered using exclusively the materialized views 
(the old and the new). 

3. The combination of the cost of evaluating the new queries over the materi- 
alized views and the cost of maintaining the new views is minimal. 

1.2 Contribution and outline 

The main contributions of this paper are the following: 

- We set up a theoretical basis for incrementally designing a DW by formulating 
the dynamic DW design problem. The approach is applicable to a broad class 
of queries, including queries involving grouping and aggregation operations. 

- Using an AND/OR dag representation of multiple queries (multiquery 
AND / OR dags) we model the dynamic DW design problem as a state space 
search problem. States are multiquery AND/OR dags representing old and 
new materialized views and complete rewritings of all the new queries over 
the materialized views. Transitions are defined through state transformation 
rules. 

- We prove that the transformation rules are sound, and complete. In this sense 
a goal state (an optimal solution) can be obtained by applying transformation 
rules to an initial state. 
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- We design algorithms for solving the problem that incrementally compute 
the cost and the size of a state when moving from one state to another. 

- The approach can also be applied for statically designing a DW, by consid- 
ering that the set of views already materialized in the DW is empty. 

This paper is organized as follows. Next section contains related work. In Section 
3, we provide basic definitions and state formally the DW design problem. In 
Section 4 the dynamic DW design problem is modeled as a state space search 
problem. Incremental algorithms are presented in Section 5. The final section 
contains concluding remarks. A more detailed presentation can be found in [11]. 

2 Related work 

We are not aware of any research work addressing the incremental design of a 
DW. Static design problems using views usually follow the following pattern: 
select a set of views to materialize in order to optimize the query evaluation 
cost, or the view maintenance cost or both, possibly in the presence of some 
constraints. 

Work reported in [2, 3] aims at optimizing the query evaluation cost: in [2] 
greedy algorithms are provided for queries represented as AND/OR graphs under 
a space constraint. A variation of this paper aims at minimizing the total query 
response time under the constraint of total view maintenance cost [3] . 

[6] and [4] aim at optimizing the view maintenance cost: In [6], given a ma- 
terialized SQL view, an exhaustive approach is presented as well as heuristics 
for selecting additional views that optimize the total view maintenance cost. [4] 
considers the same problem for select-join views and indexes together. 

The works [7, 12] aim at optimizing the combined query evaluation and view 
maintenance cost: [7] provides an A* algorithm in the case where views are seen 
as sets of pointer arrays under a space constraint. [12] considers the problem for 
materialized views but without space constraints. 

None of the previous approaches requires the queries to be answerable exclu- 
sively from the materialized views in a non-trivial manner. This requirement is 
taken into account in [9] where the problem of configuring a DW without space 
restrictions is addressed for a class of select-join queries. This work is extended 
in [10] in order to take into account space restrictions, multiquery optimization 
over the maintenance queries, and auxiliary views, and in [8] in order to deal 
with PSJ queries under space restrictions. 

3 Formal statement of the problem 

In this section we formally state the dynamic DW design problem after providing 
initial definitions. 

Definitions. We consider relational algebra queries and views extended with 
additional operations as for instance grouping/ aggregation operations. Let R 
be a set of base relations. The DW initially contains a set Vq of materialized 
views defined over R, called old views. A set Q of new queries defined over R 
needs to be answered by the DW. It can be the case that these queries can be 
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answered using exclusively the old views. In general though, in order to satisfy 
this requirement, a set V of new materialized views needs to be added to the 
DW. Since the queries in Q can be answered by the new state of the DW, there 
must exist a complete rewriting of every query in Q over the views in Vq U V. 
Let Q e Q. By , we denote a complete rewriting of Q over Vq U V. This 
notation is extended to sets of queries. Thus, we write Q^, for a set containing 
the queries in Q, rewritten over Vq U V. 

Generic cost model. The evaluation cost of Q^, denoted i?(Q^), is the 
weighted sum of the cost of evaluating every query rewriting in Q^. 

In defining the maintenance cost of V one should take into consideration 
that the maintenance cost of a view, after a change to the base relations, may 
vary with the presence of other materialized views in the DW [6]. As in [6], we 
model the changes to different base relations by a set of transaction types. A 
transaction type specifies the base relations that change, and the type and size 
of every change. The cost of propagating a transaction type is the cost incurred 
by maintaining all the views in V that are affected by the changes specified by 
the transaction type, in the presence of the views in V U Vq. The maintenance 
cost o/ V, denoted M(V), is the weighted sum of the cost of propagating all the 
transaction types to the materilaized views in V. 

The operational cost of the new queries and views is T(Q^, V) = i?(Q^) + 
cM{\). The parameter c, c > 0, is set by the DW designer and indicates the 
relative importance of the query evaluation vs. the view maintenance cost. 

The storage space needed for materializing the views in V is denoted 5'(V). 
Problem statement. We state now the dynamic DW design problem as follows. 
Input 

A set Vo of old views over a set R of base relations. 

A set Q of new queries over R. 

Functions, E for the query evaluation cost, M for the view maintenance cost 
and S for the materialized views space. 

A constant t indicating the extra space allocated for materialization. 

A constant c. 

Output 

A set of new views V over R such that: 

(a) 5(V) < t. 

(b) There is a set of complete rewritings of the queries in Q over Vq U V. 

(c) T(Q^, V) is minimal. 

4 The dynamic DW design as a state space search 
problem 

We model, in this section, the dynamic DW design problem as a state space 
search problem. 

4.1 Multiquery AND/OR dags 

A query dag for a query is a rooted directed acyclic graph that represents the 
query’s relational algebra expression. Query dags do not represent alternative 
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equivalent relational algebra rewritings of the query definition over the base re- 
lations (that is alternative ways of evaluating the query) . Alternative rewritings 
can be represented compactly by using AND/OR dags [7]. A convenient repre- 
sentation of query evaluations using AND / OR dags [6] is adopted in rule-based 
optimizers [1]. This representation, distinguishes between AND nodes and OR 
nodes: a query AND/OR dag for a query is a rooted bipartite AND/OR dag Qq. 
The nodes are partitioned into AND nodes and OR nodes. AND nodes are called 
operation nodes and are labeled by a relational algebra operator while OR nodes 
are called equivalenee nodes and are labeled by a relational algebra expression (a 
view) . The root node and the sink nodes of Qq are equivalence nodes labeled by 
the query Q and the base relations respectively. In the following we may identify 
equivalence nodes with their labels. 

An AND/OR dag Q'q is a subdag of an AND/OR dag Qq if dag Q'q is a 
subdag of dag Qq, and for every AND (operation) node in Q'q, all its incoming 
and outgoing edges in Qq are also present in Q'q. An AND/OR dag is an AND 
dag if no OR (equivalence) node has more than one outgoing edges. 

Multiple queries and alternative ways of evaluation can be represented by 
multiquery AND/OR dags. A niultiquery AND/OR dag for a set of queries is a 
bipartite AND/OR dag, similar to a query AND/OR dag, except that it does 
not necessarily have a single root. Every query in the query set is represented 
by an equivalence node in the multiquery AND/OR dag. Equivalence nodes 
representing queries are called query nodes and their labeling expressions are 
preceded in the multiquery dag by a *. All the root nodes (and possibly some 
other equivalence nodes) are query nodes. 

Example 1. Consider the relations Department (Dept ID, DeptName) (D for short) 
and Employee (EmpID, EmpName, Salary, DeptID) (E for short). An under- 
lined attribute denotes the primary key of the relation. Eigure 1 shows a multi- 
query AND/OR dag Q for the queries Qi = <Tsaiary>iooo(E IX D) and 
Q2 =< DeptID, DeptName > T < AVG(Salary) > (<Tsaiary>iooo(E IX D)). Some 
attribute names are abbreviated in the figure in a self-explanatory way. Notice 



*Q2 




Fig. 1. A multiquery AND/OR dag for the queries Qi and Q 2 

also that the query node labeled by *Qi is not a root node. Three alternative 
rewritings over the base relations for query Q 2 are represented in Q. □ 
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4.2 States 

A multiquery AND/OR Q determines, in our context, the views and the rewrit- 
ings of the queries in Q over views that can be under consideration for solving 
the dynamic DW problem. 

Definition 1. Given Q and Vq, a state s is an AND/OR subdag of Q, where 
some equivalence nodes may be marked, such that: 

(a) All the query nodes of G are present in s, 

(b) All the equivalence nodes of G that are in Vq are present in s and these are 
the only marked nodes in s, 

(c) Only query nodes or marked nodes can be root nodes. □ 

Intuitively, sink nodes represent views materialized in the DW. Marked nodes 
represent the old views (already materialized in the initial DW) that can be used 
in the rewriting of a new query. Sink nodes that are not marked represent new 
materialized views. A query dag for a query Q in s is & connected AND subdag 
of s rooted at query node Q in s whose sink nodes are among the sink nodes 
of s. It represents a complete rewriting of Q over the materialized views (sink 
nodes) of s. Since all the query nodes of G are present in s, there is at least one 
query dag for a query Q in s, for every query Q in Q. Therefore, a state provides 
information for both: 

(a) new views to materialize in the DW, and 

(b) eomplete rewritings of eaeh new query over the old and the new materialized 
views. 

Example 2. Figures 2-3 show different states for the multiquery graph of Figure 1 
when Vo = {D}. The labels of the operation nodes are written symbolically in the 
figures. Marked nodes are depicted by filled black circles. For instance, in Figure 
3(a), the nodes E, D and Vi, where Vi = <Tsaiary>iooo(E), are materialized views. 
Node D is an old view and E and Vi are new views. Two alternative rewritings for 
the query Qi are represented: the query definition Qi = <Tsaiary>iooo(E D) and 
a rewriting of Qi using the materialized view Vi, Qi = Vi N D. For the query 
Q2 the rewriting Q2 =< Dept ID > E < AVG(Salary) > (Vi) N D is represented. 

In the state of Figure 3(c) the only new materialized view is query Q\. The fol- 
lowing rewriting is represented: Q 2 =< DeptID, DeptName > E < AVG(Salary) > 
(Qi). Note that this state is not a connected graph. □ 

With every state s, a cost and a size is associated through the functions cost 
and size respectively: cost{s) = T(Q^, V), while size{s) = 5(V). 

4.3 Transitions 

In order to define transitions between states we introduce two state transfor- 
mation rules. The state transformations may modify the set of sink nodes of a 
state, and remove some edges from it. Therefore, they modify, in general, the 
set of new views to materialize in the DW and the rewritings of the new queries 
over the materialized views. 

State transformation rnles. Consider a state s. A path from a query node Q 
to a node V is called query free if there is no node in it other than Q and V that 
is a query node. 
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R1 Let Q be a query node and F be a non-sink equivalence node in s. Nodes 
Q and V need not necessarily be distinct, but if they are distinct, V should 
not be a query node. If 

(a) there is a query free path from Q to V, or nodes Q and V coincide, and 

(b) there is no path from F to a non-marked sink node that is not a base 
relation, 

then: 

(a) Remove from s all the edges and the non-marked or non-query nodes 
(except V) that are on a path from V, unless they are on a path from 
a query node that does not contain V. (Thus, node V becomes a sink 
node.) 

(b) Remove from the resulting state all the edges and the non-marked or 
non-query nodes that are on a path from Q, unless they are on a query 
dag for Q in s that contains a query free path from Q to R, or they are 
on a path from a query node that does not contain Q. 

R2 Let Q be a query node and R be a distinct equivalence node in s, that is a sink 
or a query node and is not a base relation. (R can be a marked node). If 

(a) there is a query free path from Q to V , and 

(b) there is a query dag for Q in s that does not contain a query free path 
from Q to V, 

then: 

Remove all the edges and the non-marked or non-query nodes that are 
on a path from Q, unless they are on a query dag for Q in s that contains 
a query free path from Q to R, or they are on a path from a query node 
that does not contain Q. 

Example 3. Consider the state s of Figure 2(a). We apply in sequence state 
transformation rules to s. Figure 2(b) shows the state resulting by the application 



*Qi *Qi 





Fig. 2. States 

of R1 to query node Q\ and to equivalence node Ri of s. By applying R1 to nodes 
Q 2 and Ri of s, we obtain the state of Figure 3(a). The state of Figure 3(b) results 
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Fig. 3. States resulting by the application of the state transformation rules 

from the application of R2 to query nodes Q 2 and Qi of the state of Figure 2(b). 
Figure 3(c) shows the state resulting by the application of R1 to nodes Q 2 and Qi 
of state s (Figure 2(a)). Query node Qi represents now a materialized view. □ 

The state transformation rules are sound in the sense that the application of a 
state transformation rule to a state results in a state. Note that the soundness of 
the state transformation rule entails that a transformation of a state preserves 
the existenee of a eomplete rewriting of all the new queries over the materialized 
views. 

We say that there is a transition T{s, s') from state s to state s' if and only 
if s' can be obtained by applying a state transformation rule to s. 

4.4 The search space 

We define in this subsection the search space. We first provide initial definitions 
and show that the state transformation rules are complete. 

Definition 2. Given Q and Vq, the initial state sq is a state constructed as 
follows. First, mark all the equivalence nodes of Q that are in Vq. Then, for each 
marked node, remove all the edges and the non-marked or non-query nodes that 
are on a path from this marked node, unless they are on a path from a query 
node and this path does not contain the marked node. □ 

Assumptions. We assume that all the views and the rewritings of the new 
queries over the views considered are among those that can be obtained from 
the multiquery AND/OR dag Q. Further, consider a set of V of new materialized 
views and let be a set of cheapest rewritings of the queries in Q over Vq U V. 
In computing the view maintenance cost of V, we assume that a materialized 
view that does not occur in is not used in the maintenance process of another 
view in V. 

Definition 3. Given Q and Vq, a goal state Sg is a state such that there exists 
a solution V satisfying the conditions: 

(a) The non-marked sink nodes of Sg are exactly the views in V, and 

(b) The cheapest rewritings of the queries in Q over VoUV are represented in Sg. 
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The following theorem is a completeness statement for the state transformation 
rules. 

Theorem 1. Let Q be a multiquery AND/OR dag for a set of new queries Q, 
and Vo be a set of old views. If there is a solution to the DW design problem, a 
goal state for Q and Vq can be obtained by finitely applying in sequence the state 
transformation rules to the initial state for Q and Vq. □ 

Search space definition. Viewing the states as nodes and the transitions be- 
tween them as directed edges of a graph, the search space is determined by the 
initial state and the states we can reach from it following transitions in all possi- 
ble ways. Clearly, the search space is, in the general case, a finite rooted directed 
acyclic graph which is not merely a tree. As a consequence of Theorem 1, there 
is a path in the search space from the initial state sq to a goal state Sg . 

5 Algorithms 

In this section we present incremental algorithms for the dynamic DW design 
problem. Heuristics that prune the search space are provided in [11]. 

The cost and the size of a new state s' can be computed incrementally along 
a transition T{s,s') from a state s to s' [11]. The basic idea is that instead 
of recomputing the cost and the size of s' from scratch, we only compute the 
changes incurred to the query evaluation and view maintenance cost, and to the 
storage space of s, by the transformation corresponding to T{s,s'). 

Any graph search algorithm can be used on the search space to exhaustively 
examine the states (by incrementally computing their cost and size) , and return a 
goal state (if such a state exists). We outline below, a variation of the exhaustive 
algorithm guaranteeing a solution that fits in the allocated space, and a second 
one that emphasizes speed at the expense of effectiveness. 

A two phase algorithm. This algorithm proceeds in two phases. In the first 
phase, it proceeds as the exhaustive algorithm until a state satisfying the space 
constraint is found. In the second phase, it proceeds in a similar way but excludes 
from consideration the states that do not satisfy the space constraint. A two 
phase algorithm is guaranteed to return a solution that fits in the allocated 
space, if a goal state exists in the search space. In the worst case though, it 
exhaustively examines all the states in the search space. 

An r-greedy algorithm. Exhaustive algorithms can be very expensive for a 
large number of complex queries. The r-greedy algorithm proceeds as follows: 
for a state considered (starting with the initial state) all the states that can be 
reached following at most r transitions are systematically generated and their 
cost and size are incrementally computed. Then, the state having minimal cost 
among those that satisfy the space constraint, if such a state exists, or a state 
having minimal size, in the opposite case, is chosen for consideration among 
them. The algorithm keeps the state s/ satisfying the space constraint and hav- 
ing the lowest cost among those examined. It stops when no states can be gen- 
erated from the state under consideration and returns s / . This algorithm is not 
guaranteed to return a solution to the problem that fits in the allocated space, 
even if a goal state exists. 
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6 Conclusion 

In this paper we have dealt with the issue of incrementally designing a DW by 
stating and studying the dynamic DW design problem: given a set of old views 
materialized in the DW, a set of new queries to be answered by the DW, and 
extra space allocated for materialization, select a set of new views to materialize 
in the DW that fits in the extra space, allows a complete rewriting of the new 
queries over the materialized views and minimizes the combined evaluation cost 
of the new queries and the maintenance cost of the new views. A dynamic DW 
design process allows the DW to evolve smoothly in time, without interrupting its 
operation due to materialized view removal. We have modeled the dynamic DW 
design problem as a state space search problem, using a multiquery AND/OR 
dag representation of the new queries. Transitions between states are defined 
through state transformation rules which are proved to be sound and complete. 
We have designed generic incremental algorithms and heuristics to reduce the 
search space. Also shown is that this approach can be used for statically designing 
a DW. 
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Abstract. The conventional star schema model of Data Warehouse (DW) has its 
limitations due to the nature of the relational data model. Firstly, this model 
cannot represent the semantics and operations of multi-dimensional data 
adequately. Due to the hidden semantics, it is difficult to efficiently address the 
problems of view design. Secondly, as we move up to higher levels of summary 
data (multiple complex aggregations), SQL queries do not portray the intuition 
needed to facilitate buildingand supporting efficient execution of complex 
queries on complex data. In light of these issues, we propose the Object- 
Relational View (ORV) design for DWs. Using Object-Oriented (0-0) 
methodology, we can explicitly represent the semantics and reuse view (class) 
definitions based on the ISA hierarchy and the class composition hierarchies, 
thereby resulting in a more efficient view mechanism. Part of the design 
involves providing a translation mechanism from the star/snowflake schema to 
an 0-0 representation. This is done by flattening the fact-dimension schema 
and converting it to a class-composition hierarchy in an 0-0 framework. 
Vertically partitioning this 0-0 schema further increases the efficiency of query 
execution by reducing disk access. We then build a Structural Join Index 
Hierarchy (SJIH) on this partitioned schema to facilitate complex object 
retrieval and avoid using a sequence of expensive pointer chasing (or join) 
operations. 



1 Introduction 

In order to support complex OLAP queries and data visualization, data warehouses 
primarily contain historical, consolidated data. This multi dimensional data is 
represented in popular relational systems by a star / snowflake schema [RBS97] 
[Mic95]. This framework also supports summary views on pre-aggregated data. 
Querying efficiency can be enhanced by materializing these views [GM95] and by 
building indexes on them. 
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In [VLK98], we examined issues involved in developing the Object-Relational 
View (ORV) mechanism for the data warehouse. Here, OR means an object-oriented 
front-end or views to underlying relational data sources. So, the architecture and 
examples we provide follow our interpretation of OR. It must be noted though, that 
the merits of this proposal can be applied to views in Object-Relational Databases 
(ORDBs) [CMN97] also. The layered architecture of the ORV is as shown in figure 1. 
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Fig. 1. A Layered Architectural Framework 



In this framework, two models are captured; both have a multi layer architecture, 
consisting of wrapper/monitors, integrator and summarizing units. In the first model, 
relational - 00 translation is done after database integration, hence the warehouse 
data is built on the underlying integrated framework. In the second model, we perform 
translation into the 0-0 model at the wrapper level, so that the canonical model for 
the integrated schema is the 0-0 model, offering more flexibility in dealing with 
diverse semantics of the underlying data [NS96]. 

The Complete Warehouse Schema (CWS) in both models contains Base classes 
(BWS) which include some directly mappable classes and some derived View classes 
(VWS) based on summarizing queries. Further more, views (Virtual classes) can be 
inherited from this CWS. These views may be partially or completely materialized. 

Primary issues include translating the relational data structures into a class 
hierarchy, defining class structures for the summary views, supporting object ids for 
object instances of the views (classes) generated, handling those classes with respect 
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tomaintenance and providing links to other classes in the hierarchy, and accessing and 
querying these view classes. 



1.1 Paper Contribution and Organization 

Based on the issues discussed in [VLK98], this paper puts forward an object- 
relational (OR) view approach to address the issues in data warehousing. More 
specifically, we devise a translation mechanism from the star/snowflake schema to an 
object oriented (0-0) representation. We also identify some query processing 
strategies utilizing vertical partitioning and SJIH techniques for complex queries on 
complex objects. 

The rest of the paper discusses how the above mentioned issues can be tackled and 
our proposal to solve them. In section 2, we present an overview of our ORV 
approach. In section 3, we provide algorithms to translate a star/snowflake schema to 
an 0-0 model. In section 4, we refine the schema using some query processing 
strategies, and apply the SJIH in a data warehousing environment. Finally, we 
conclude in section 5 and highlight subsequent work. 



2 Overview of the ORV Approach 

We detail in this section the object - relational view (ORV) approach to data 
warehousing. As 0-0 data models are semantically richer, the ORV mechanism can 
explicitly state multi-dimensional data semantics. Also as we shall see in the 
following sections, the ORV approach leads to better view design and easier 
maintainability. The ORV's support for query-centric indexing facilities also provides 
an improvement in query performance. Note that in this paper, we deal only with the 
part of the data warehouse that is translated from relational to the 00 model. 



2.1 Motivating Example 

In a relational DW, the fundamental star schema consists of a single /act table, and a 
single denormalized dimension table for each dimension of the multidimensional data 
model. To support attribute hierarchies the dimension tables can be normalized to 
create snowjlake schemas, resulting in smaller dimension tables that lead to lesser join 
cost and hence help in improving query performance and in view maintenance. We 
adapt an example from [CD97] to demonstrate the need for the ORV approach. 

Figure 2 shows a snowflake schema for a Sales DW with one fact table and 
dimension tables representing Time, Product, Customer, and Address hierarchies. 
OLAP queriescould be posed on various predicates along a single hierarchy, as well 
as on predicates along multiple hierarchies. Summary tables could be defined along a 
predicate or set of predicates by separate fact tables and corresponding dimension 
table(s). These summary tables could be materialized depending on various 
materialization selection algorithms to improve querying cost. As seen in the figure, 
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Product Order Date 




Fig. 2. A Sample Snowflake Schema 



the dimension tables in the snowflake schema (along with schema for summary 
tables) are in a composition hierarchy. Hence they can be naturally represented as an 
Object-Oriented schema. Therefore, querying costs (Join) on complex predicates 
along this snowflake schema should be analogous to querying costs by pointer 
chasing mechanism in an 0-0 framework. 

From [FKL98], we see that the Structural Join Index Hierarchy (SJIH) mechanism 
is far superior to pointer chasing operations for Complex Object retrieval, especially 
in queries involving predicates from multiple paths. Experimental results [Won98] 
conform with the analytical results of this cost model. 



2.2 Methodology for ORV Design 

Our view design methodology depends partly on the type and pattern of queries that 
access the DW frequently. By incorporating these access patterns, we can form an 
efficient framework for retrieving popular queries. Note though that as the queries 
change, the 0-0 schema mayrequire changes in terms of partitioning and indexing, 
but the underlying schema is fairly static because of embedded semantics. This 
implicit support of semantics also enables efficient retrieval of multiple query paths 
along the same dimension hierarchy. For example in the Time dimension, multiple 
paths could be along the Week, Month & Season compositions. These are supported 
by the Class Composition Hierarchy ( CCH) framework as shown in figure 3. 

As shown in figure 4, we illustrate our methodology in three phases, Phase-2 & 
Phase-3 are repeated until the 00 schema and SJIH are optimized. These phases are 
explained in detail in the following sections. 
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Fig. 3. The Time Hierarchy Fig. 4. The ORV Design methodology 



3 Translating from Snowflake to 0-0 Schema 

The fundamental star schema model consists of a single Fact Table (FT) and multiple 
Dimension Tables (DTs). This can be further subclassed as snowflake (normalizing 
along DTs) and multi-star (normalizing along FTs) and combinations of multi-star & 
snowflake schema models. We illustrate our translation mechanism here on the single 
star / snowflake schema model. Note that a generic extension to include multi-star 
schema models can be easily derived due to advantages of the 0-0 model as stated in 
section 2. 



3.1 Star / Snowflake Schema 

A snowflake schema consists of a single Fact Table (FT) and multiple Dimension 
Tables (DT). Each tuple of the FT consists of a (foreign) key pointing to each of the 
DTs that provide its multidimensional coordinates. It also stores numerical values 
(non-dimensional attributes, and results of statistical functions) for those coordinates. 
The DTs consist of columns that correspond to attributes of the dimension. DTs in a 
star schema are denormalized, while those in snowflake schema are normalized 
giving a Dimension Hierarchy. A generalized view of the snowflake schema is 
presented in figure 5. 

Preliminaries 

Every tuple in the FT consists of the fact or subject of interest, and the dimensions 
that provide that /act So each tuple in the FT correspondsto one and only one tuple in 
each DT. Whereas one tuple in a DT may correspond to more than one tuple in the 
FT. So we have a 1:N relationship between FT : DTs. 

Let the snowflake schema be denoted as SS. 

No. of FT = l;No. ofDT = x. 
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We denote the relations between the FT and DTs as: 
Rel (FT, DT) = Ri 

1 < i < X ; where x is the no. of DTs 




Fig. 5. Generalized view of a snowflake schema Fig. 6. Corresponding initial 0-0 schema 

Let the Relations between DTs in a dimension hierarchy be denoted as: 

Rel (DT[, DTr‘) = R[ 

0 < r < m ; where m is the no. of relations in the hierarchy under DTi. 
andDT°= DTi 



Table 1. Elements of the Fact Table (FT) Table 2. Elements of the Dimension Table (DT) 



{D.k} 


set of Dimension keys, each 
corresponding to a Dimension 
Table (DT). 1 < i < x ; where x 
is the no. of DTs 


{m|} 


set of member attributes. 

0 < j < y ; where y is the no. of 
attributes 


{fs} 


set of results of statistical 
functions. 1 < s < z ; where z is 
the no. of function results. 



Dik 


Index of the DT 


{ail 


Set of member attributes. 

0 < j < n ; where n is the no. of 
attributes. 


{Rirk} 


set of keys of relations which 
forms its Dimension Hierarchy. 
0 < r < m ; where m is the no. 
of relations in the hierarchy 
under DTi 



3.2 Query-driven Methodology for OO Schema Refinement 

Our methodology intends to capture the hidden semantics behind a DW schema 
design, by incorporating the star / snowflake schema information with the query type 
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and pattern information. Frequent Data warehousing queries can be thought of being 
decomposed and categorized into the following form : 

Q { Qi tj Q 2 } 

where Qi is the set of queries that would lead to vertically partitioning the schema, 
and Q 2 is the set of queries that would induce horizontally partitioning of the schema. 

Based on this classification, we can refine the resultant schema in two 
complementary ways. Refinement-1, which involves Qi is covered immediately 
below, while Refmement-2, which involves Qiis covered under section 4. 

Refinement 1 - vertical partitioning 

In an OODB environment, vertical partitioning can be regarded as a technique for 
refining the OODB schema through utilizing the query semantics to generate a finer 
class composition hierarchy of any class. The refinement can be accomplished in a 
step-by-step manner, as shown below. 

We note that in terms of predicates accessed in the DTs, queries of type Qi can be 
defined as 

Qi ( DTi”^. {aj} ) where {Ej} is a set of attributes of DTi". 

Step VI. For the Fact Table FT in the snowflake schema, create a class Co in the 0-0 
schema. 

Create Co 

Step V2. For each Dimension Table DTj in the snowflake schema, create a class Ci in 
the 0-0 schema. 

V DTi Create Ci 

Step V3. For each relation Rj in the snowflake schema, create a pointer to OID, pOIDi 
in class Co in the 0-0 schema. 

V Ri Create Co . pOIDi = OID(Ci) 

Step V4. For each member attribute mj in FT in the snowflake schema, create an 
attribute mj in class Co in the 0-0 schema. 

V iTij in FT Create Co . trij 

Step VS. For each result-value attribute f^ in FT in the snowflake schema, create an 
attribute fs in class Co in the 0-0 schema. 

V fg in FT Create Co . fs 

Step V6. For each relation Ri in the snowflake schema, create a class C[ in the 0-0 
schema. 

V Ri"' Create Ci"' 

Step V7. For each member attribute aj in DTj"' in the snowflake schema, create an 
attribute aj in class Ci in the 0-0 schema. 

V i (Vr DTi"^. 3j Create Ci. aj ) 

Step VS. For each relation Ri in the snowflake schema, create a pointer to OID, 
pOIDl in class Ci in the 0-0 schema. 
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This is a recursive step, as it navigates through the dimension hierarchy. The 
relations between the various nodes of the DT are explicitly captured, so steps 6-7 can 
be repeated in the hierarchy loop. 

V Rirk Create Cir . pOIDi’' = OID(Ci") 

Step V9. For each Query Qj in Qi, which accesses a set of {aj} belonging to a DT in 
Dj, vertically partition the corresponding class Q in the 0-0 schema. 

V Qi (Vd DT„. {aj} Create C„j <— C„) 



3.3 Corresponding 0-0 Representation 



As seen in figure 6, the generalized view of the 0-0 schema is similar to that of the 
snowflake schema. The class corresponding to FT is Co . 




Fig. 7. The OO Schema. 

The figure shows the class composition hierarchy for the Time dimension after refinement 1, 
and the is-a hierarchy (shaded area) for the Customer dimension after refinement 2. 



Figure 7 (without the shaded area) shows the translated 0-0 schema for the Sales 
example taken in previous sections, which is generated by tracing the steps of the 
above algorithm step-by-step: Note that this hierarchy is not a mere mapping of FTs 
and DTs from the snowflake schema. The classes mapping to the DTs are further 
vertically partitioned according to the queries acting on them. For an example of 
multiple paths within a single dimension hierarchy, let us consider the Time (Date) 
hierarchy. If the queries access Date by multiple paths like Dayjof _Week, or Dayjof 
_Month or Week_of_Quarter, they must be supported within the same path, instead of 
having to access disjoint entities (classes). 



4 Query Processing Strategies 

In this section, we enhance the 0-0 schema by further horizontally partitioning it in 
Refmement-2. We complement this partitioned schema with the Structural .Join Index 
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Hierarchy (SJIH) to facilitate complex object retrieval and avoid using a sequence of 
expensive pointer chasing (or join) operations. 



4.1 Refinement 2 - horizontal partitioning 

In terms of values of predicates accessed in the DTs, queries of type Q 2 , can be 
defined as 

Q 2 — > ( DTir- aj . {v]^ } ) where Vk is a set of values of attribute hj of DTj’'. 
Step HI. For each Query Qj in Q 2 , which accesses a record containing a set of values 
{Vk} for attribute aj belonging to a DT in Dj, horizontally partition the corresponding 
class Q in the 0-0 schema. 

V Qi (Vd DTa. aj . {v,^ } Create Cajk : : C^) 

This forms the is-a hierarchy of the 0-0 schema. Here, the classes mapping to the 
DTs are further horizontally partitioned according to the queries acting on them. This 
subclassing ensures that specialized classes are available while maintaining a high 
degree of reusability. 

Corresponding 0-0 representation: The class analogous to FT is Co (Sales) as 
noted in sec. 3, and the algorithm can be extended for multi-star schema by 
partitioning Co to obtain the CCH and is-a hierarchies. As seen in fig. 7, the resultant 
0-0 schema after Refinement-2 contains specializations for the Customer class (the 
shaded area). This schema is the CWS, over which summary views may be built. 



4.2 Indexing - SJIH 

The Structural Join Index Hierarchy (SJIH) [FKL98] is a comprehensive framework 
for efficient complex object retrieval in both forward and reverse directions. It is a 
sequence of OIDs which provides direct access to component objects of a complex 
object, possibly through different paths. We use the SJIH on our ORV schema, and 
extend its applicability from class composition hierarchies to also include is-a 
hierarchies, thereby encompassing the Complete Warehouse Schema (CWS) in ORV. 
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The total cost of the SJIH framework can be broadly categorized as storage cost, 
index retrieval cost and index maintenance cost. In this paper we also incorporate 
query-centric information including selectivity to determine the selection of forward 
and backward paths (> 2 paths) during creation (storage) and retrieval of SJIH. 

As shown in figure 8, the SJIH is built on objects from first level classes Sales, 
Product, Customer, and Date (Year) in the class composition hierarchy. As class 
Customer is subclassed as Teenager, the SJIH now involves the is-a along with the 
CC hierarchies. The implicit link provided by the 0-0 system between classes and 
their subclasses provides the link from the complex object (Sales) to the 
specializations (Teenager) of its component objects (Customer). This SJIH can utilize 
the is-a link along with the CCH links already exploited by previous works. 

The storage and retrieval cost involve determination of an optimum traversal path 
between the objects. In such a case, n paths will have to be traversed of which one 
path would be in the forward direction and the other (n-i) would be in the reverse 
directions. The cost is proportional to the cardinality of the Join Index. 

According to the cost model as shown in [Fun98], the Cardinality of a SJI, rooted 
by a class C , is given as : n = ||C,|| X MF(C.) where MF is the Multiplying Factor. 

In general, MF is given as: MF(C.) = K, x OPt/c. . c ^ MF(C,)) 
where C is a child class of C^; K, is a constant depending on the degree of 
sharing/forward fan-out between the root and its shared sub-classes; and OP is either 

the max or the product of the forward fan-out values, depending on whether the pair- 
up between classes is constrained or unconstrained. 

A sample query, query parameters and proposed indexes are illustrated in the 
Appendix. By following the heuristic hill-climbing algorithm provided in [Fun98], we 
find that the optimal SJIH for the given query would be the second one which 
involves the classes Sales (S), Teenager (T), Product (P), and Year (Y), which incurs 
the least total number of page accesses (cf. Table B of Appendix). 



5 Conclusion 

The 0-0 model is extremely flexible and provides in all stages for OLAP operations 
on a Data Warehouse. The ORV model can provide an intuitive and efficient 
framework of a DW. In this paper we have shown our methodology to achieve an 
efficient transformation between the relational DW and an 00 DW, and devised 
efficient query retrieval using the SJIH. Currently we're conducting analytical and 
experimental studies to show the benefit of partitioning for retrieval. The preliminary 
results are quite encouraging. Subsequent work involves a study on other indexing 
aspects like storage cost and maintenance cost. The support of the 0-0 model for 
dynamic schema changes will help in maintaining the DW during updates and also 
during the occasional structural changes. Metadata handling and indexing with 
dynamic reclassification of objects and OID manipulation is also an interesting field 
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of investigation. To support our model, a classification of DW benchmarking queries 
is also being investigated, with effects on view design. A prototype system is 
currently being developed to address many of these challenging issues and to 
demonstrate the effectiveness of the ORV approach to data warehousing. 
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Appendix 



Sample query O : Total Sales of Products in jP_SETj, to Teenagers, group by Product, by Year. 
Assumptions 

• {P_SET} contains 50% of all Products. 

• 80% of Sales to Teenagers consist of Products in {P_SET}. 

• 20% of Customers are Teenagers. 

Table A. Query Parameters Table B. Disk 1/ O cost for Q 



Reference (i^.i) 


fo 


R 


lie, II 


IIQII 


Sales^Product 


1 


100 


50M 


.5M 


Sales^Customer 


1 


50 


50M 


IM 


Sales^Teenager 


1 


250 


50M 


2M 


Sales^Date 


1 


500 


50M 


36.5K 


Prod— ^Category 


1 


10 


.5M 


IK 


Product^Retailer 


50 


100 


.5M 


50K 


Category^Type 


100 


5 


1000 


10 


Retailer— >City 


1 


4 


50,K 


12.5K 


Customer— >City 


1 


80 


IM 


12.5K 


Year— >Mon 


12 


1 


10 


120 


Mon— >Date 


30 


1 


120 


3.6K 


Year— >Date 


365 


1 


10 


3.6K 


Country— >State 


25 


1 


10 


250 


State^City 


5 


1 


250 


1.2K 


Country— >City 


125 


1 


10 


1.2K 



No 


Type of Index 


No. of 

Page 

access 


1 


SJIH-1 (S,C, P, Y) 


19532 


2 


SJIH-2 (S, T, P, Y) 


3907 


3 


SJIH-3 (S1,S2); 
SlfS,T, P); 
S2(S, Y) 


12113 


4 


SJIH-4 
(S3, S4, S5); 

S3(T, P); S4(S, T); 
S5(S, Y) 


12892 



Notes: 

• The Sales class is the root and contains the 'value' desired in most queries, viz. '$ sales' or 
'units'. 

• Since each Sales object can appear in only one object path; i.e. there no Sales object is 
shared byany two objects in the same class, the maximum cardinality of any SJI involving 
Sales is equal to the cardinality of the Sales class. So ME calculations using degree of 
sharing is not required. 

• Duplicate factor in SlI is ignored here, as we're not concerned with managing object 
deletions. 

• Only one Product is sold in a Sale. 
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Abstract. To simplify the task of constructing wrapper/monitor for the 
information sources in data warehouse systems, we provide a modularized 
design method to reuse the code. By substituting some parts of wrapper 
modules, we can reuse the wrapper on a different information source. 
For each information source, we also develop a toolkit to generate a 
corresponding monitor. By the method, we can reduce much effort to 
code the monitor component. We also develop a method to map the 
object-relational schema into relational one. The mapping method helps 
us make an uniform interface between a wrapper and an integrator. 



1 Introduction 

In contradiction to on-demand approach {extract data only when processing 
queries) of traditional databases, data warehouse systems provide a in-advanced 
approach (interested data are retrieved from information sources in advance). 
Because processed information has been stored in data warehouse, there ex- 
ists inconsistency between data warehouse and underlying information sources. 
According to the WHIPS (WareHouse Information Project at Stanford) architec- 
ture [HGMW-l-95], we can use Monitor/ Wrapper components to detect modifica- 
tion in information sources and to maintain the consistency between information 
sources and data warehouse system. 

Monitor/ Wrapper is germane to underlying information sources, so we code 
different Monitors/ Wrappers for different information sources. It wastes much 
cost to re-write the Monitor/ Wrapper for each information source. We can di- 
vide the wrapper into several modules. When new protocols applied or new 
information sources occupied, we can substitute some modules and reuse others 
to construct Wrapper /Monitor rapidly. 

This paper focuses on how the modularized design is applied on a wrapper, 
and discusses how to solve the mismatch between the query processor and the 
information source. 

The remainder of the paper is organized as follows. In section 2, we overview 
the related work. In section 3, we propose the architecture and modules for 
designing a wrapper /monitor. We show some examples in section 4 and conclude 
in section 5. 
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2 Related Work 

The goal of WHIPS [HGMW+95] is to develop algorithms and tools for the 
efficient collection and integration of information from heterogeneous and au- 
tonomous sources, including legacy sources. There are three main components 
in WHIPS architecture; data warehouse, integrator, and monitor /wrapper. 

Data warehouse stores integrated information available for applications. 
Integrator receives update notification sent by monitor. If this update affects 
integrated information in the data warehouse, integrator must take appropriate 
actions, including retrieving more information from information sources. Moni- 
tor component detects the modification applied to the information source. These 
modifications will be passed to integrator module. Wrapper component trans- 
lates queries propounded by query processor from internal representation used 
by data warehouse system to native query language used by information sources. 

The goal of the TSIMMIS [CGMH-l-94] project is to develop tools that 
facilitate the rapid integration of heterogeneous information sources. TSIM- 
MIS project uses a common information model {Object Exchange Model, OEM 
[PGMW95]) to represent the underlying data. Translator in TSIMMIS convert 
the query language in OEM into the native query language, and convert the 
results into OEM. 

University of Maryland has proposed an architecture of an Interoperability 
Module (IM) to process queries on heterogeneous databases [Chang94,CRD94]. 
The IM resolves the conflicts among different databases by two kinds of pa- 
rameterized canonical representations (CR). [Chang94] proposes two kinds of 
parameterized canonical form to resolve two kinds of heterogeneity, query lan- 
guage and different schema respectively. 



3 System Design and Implementation 
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Fig. 1. System Architecture 



Fig. 2. System Initialization 
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3.1 System Architecture 

fAJ Functions of Each Module 

We first briefly describe functions of each module in our system architecture 
shown in Figure 1. Driver is responsible for retrieving data requested by other 
modules. Converter resolves the representation conflicts among the information 
source and wrapper/monitor. Modification Retriever detects the changes in in- 
formation source and propagates the messages to notify integrator. Packager 
transforms data from internal form into the form recognizable by integrator. 
Translator resolves the schema conflicts between the information sources and 
the integrator. 



(B) The Interaction between Monitor /Wrapper and Integrator 

System Initialization We initialize each wrapper of information sources 
when the integrator is started (Figure 2(a)). Wrapper will start up the moni- 
tors which belong to the information source (Figure 2(b)). The wrapper notifies 
the integrator what relations it handles, and monitors send the corresponding 
relation schema to the integrator. All schema information will be registered at 
integrator (Figure 2(c)). Finally, monitor module checks the update message. If 
monitor finds any update, it will notify the integrator (Figure 2(d)). 
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Fig. 3. Periodically Update Detection in Fig. 4. Query Arrived at Wrapper 
Oracle 



Periodically Update Detection Every update will be stored in a table 
before it is sent (Figure 3(a)). At the predetermined time, monitor sends the 
updates to the integrator (Figure 3(b)) and cleans the table (Figure 3(c)). 

Then, when another update applied on the relation (Figure 3(d)), the up- 
date will be recorded in the table (Figure 3(e)). The update will be sent to the 
integrator at next predetermined time (Figure 3(f)). 
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Query Arrived at Wrapper When a query arrived at the wrapper compo- 
nent, the wrapper first notifies the monitor to detect the updates (Figure 4(a)). 
The monitor detects the modifications from the information source (Figure 4(b)) 
and sends the updates to the integrator (Figure 4(c)). Wrapper does not send 
the query to the information source until the update detection completes. 

When monitor completes its detection, it will notify the wrapper to con- 
tinue the query (Figure 4(d)). Then, wrapper sends the query to information 
source and gets the results (Figure 4(e)). Finally, the results will be sent to the 
integrator (Figure 4(f)). 



3.2 The Design of Each Module 

In this subsection, we introduce a design of each module in system architecture. 

(A) Packager 

Internal data type in wrapper and monitor is defined for efficiency or simplicity, 
but the type may differ from the type used between the integrator and packager. 
Packager module is responsible for transforming internal data type into one that 
integrator can recognize. 

(B) Modification Retriever 

Modification retriever module can retrieve data from converter, filter the inter- 
ested data, then ask the packager to propagate these data to integrator. When 
the module works with an non-cooperative information source, it can use other 
detecting methods to achieve update detection. 

(C) TVanslator 

TVanslator module provides mappings of query language and schema between 
integrator and information source. Among heterogeneous systems, the main job 
is to code the translator module when a wrapper is developed. 

(D) Converter 

Converter module resolves conflicts in data representation. For example, we may 
use different representations of the same DATE data in databases, e.g., T975- 
03-19’, so the module takes responsibility of transforming into the same style 
that integrator has, e.g., T9-MAR-75’. Converter serves not only Wrapper but 
Monitor, because detection information also needs to be transformed to the style 
belonging to integrator. 

(E) Driver 

Driver module processes the query forwarded by the converter module. Driver 
module should be provided by information source vendor, or be coded by pro- 
grammer in the worse case. The interface of a database driver tends to be unified 
or to use multi-tier architecture. This makes it easier to develop a Converter 
module quickly. 
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(F) Miscellaneous Modules 

The Toolkit uses a description file, which describes the schema of the infor- 
mation source, to generate the corresponding monitor. We can rapidly develop 
a monitor by using the toolkit. For different information sources, we should 
develop corresponding toolkits to match the demands. 

Metadata provides the processing information about the information source 
to execute query, transform data and retrieve data. It may be embedded in the 
code of each module. 



3.3 Implemetation Tool and Environment 

The whole system is implemented by Java Language. We distribute the Java-to 
Java applications by Remote Method Invocation (RMI). In the WHIPS, they use 
CORBA (ILU) to hide the low-level communication. Java Database Connectivity 
(JDBC) is an access interface of relational databases. It provides an uniform way 
to access different relational databases by Java. 

In this paper, we use two database systems, PostgreSQL and Oracle, as the 
information sources when we implement our system. In the early implementa- 
tion, we use PostgreSQL 6.2 and Oracle? on Solaris. We use PostgreSQL 6.3 on 
FreeBSD and OracleS on Solaris later in our implementation. 



3.4 Detailed Implementation of Each Module 

(A) Driver Module 

We use the JDBC driver provided by DBMS vendors as the driver module. 



(B) Converter Module 

The main job of converter module in our approach is to resolve the conflicts in 
data representation. When a query arrived at converter module, we use a parser 
to find the conflicts and between different data formats and convert them to one 
that information source has. Then the translated query will be sent to driver 
module. We provide the same interface as Translator module, so integrator can 
directly communicate with convert. 



(C) Packager Module 

The packager module in our approach is responsible for translating the internal 
data into strings. The reason to use string as a data type is simple because inte- 
grator component can directly form a query by using these string and transform 
them into another type. 

Besides, each object has toStringO method in Java, so the transformation 
can be directly applied on every object. We can also define our class type, which 
overrides the method, to support new data types. 
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(D) Modification Retriever Module 

For cooperative information sources, such as Oracle, we use triggers to record 
the modification information in another table, and periodically retrieve data 
from this table to notify integrator. For non-cooperative information sources, 
such as PostgreSQL, we use snapshot algorithm to retrieve the modification. In 
our approach, the source code of this module is generated by toolkit module. 

(EJ Toolkit Module 

If there are a lot of tables in one database, it will be very helpful to use toolkit 
to generate modification retriever module. This module takes the description file 
and generates corresponding source code of the modification retriever module. 

When the information source changes, we can modify the description file and 
generate modification retriever quickly. In different information sources, we must 
develop corresponding toolkits to meet the different demands. 

(F) Translator Module 

In our approach, we provide a relational schema to the integrator component. 
Therefore, we should map the schema between the underlying information source 
and the integrator. 

OracleS is an object-relational database system. When we use OracleS as one 
of the information source, we must provide some mappings. In the following, we 
introduce these mappings. The direction of the mapping is from translator to 
integrator. 

OID 

We map an OID into a primary key. We need to identify the primary key to 
satisfy the demand of Strobe algorithm. OracleS can offer an oid for each object, 
so we can directly use oid as the primary key. 

CLASS 

We map the class into a relation. When integrator retrieves data from the 
relation, we map it to retrieve data from a class. REF is an oid referencing to 
another object. REFs will be mapped into foreign keys that reference to primary 
keys, i.e. OIDs. 

Set 

We map the set into another table. When queries are applied on the mapped 
relation, we ctin tr^mslate them into the native form. 



Relationship 



Association 

We map references into foreign keys. As discussed above, the REF will be 
mapped into join between two relations. 
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Nested Attribute 

We map the nested attributes into another relation. We map the nested attribute 
into a foreign key which references to an instance of the additional relation. 
Querying on the additional relation will result in translating the query. 



4 Examples 

In our PDM example database [Chang96], we briefly show how much effort we 
saved in Table 1. The description file can be created by our toolkit interface, 
then our toolkit will process the file and generate the monitor component. As 
we see in Table 1, there are total 1942 lines in 6 monitors on the PostgreSQL 
while 1322 lines in 15 monitors on the Oracle. In the example, we can save 
about 80% effort to code our monitors (1322 lines) by the toolkit (200 lines) 
on Oracle. We can also save about 65% effort to code our monitors (1942 lines) 
by the toolkit (629 lines) on PostgreSQL. The main reason to save the effort is 
the vast amount tables, and the average monitor size is shorter than the toolkit 
module. Once there are only few tables in the database, the cost is even higher 
when we develop a toolkit. Coding a monitor on a non-cooperative information 
source (PostgreSQL) is more complex than on a cooperative one (Oracle). By our 
toolkit, all the details of a monitor can be ignored when we develop our monitor. 
The more tables on a information source, the more benefit we can acquire. The 
toolkit is also adequate to the situation when the schema may be changed. 





Oracle PostgreSQL 


Toolkit File 


200 lines 


629 lines 


Tables 


15 


6 


Generated Files 


1322 lines 


1942 lines 


Description Files 


107 lines 


41 lines 



Table 1. Generating Monitors with a Toolkit 



4.1 Query on the Object-Relational DBMS 

In this subsection, we show how translator module processes the SQL over an 
Object-Relational database, OracleS. As we see in Section 3, there are conflicts 
between the integrator and the database. Integrator sends the pure relational 
query to the translator module, but the OracleS may use additional features 
which the query cannot handle. 

Therefore, we provide pseudo tables for integrator so that the semantic of 
the SQL applied on these tables can be easily retrieved. 
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(A) The Schema 

The database contains four types: employee, department, location and phone 
type (Figure 5). 




Fig. 5. Mapping Schema Fig. 6. Communicate with Converter Mod- 

ule 



employee type 

dept refers to the department object, which employee works in; supv refers to 
the supervisor object of employee type; position is a nested attribute, contains 
building and city attributes; phone is a set of phone type. 

department type 

mgr refers to the manager object of employee type. 
location type 

This type can hold office-information for employees. 
phone type 

The phone attribute is the phone number of an employee. 

(B) Queries Sent by Integrator 

Because the integrator uses relational schema, we map the object-relational 
schema into relational one (Figure 5) . Queries based on relational model will be 
translated into what can be processd by the information source, which uses an 
object-relational model. By mapping the schema and translating the query, we 
provide pseudo tables and ability to use the SQL language which acts on these 
tables. 

( C) Directly Communicate with Converter Module 

If integrator and information source use the same data model, the integrator 
can directly communicate with converter module. For example, if the integrator 
can handle the additional features of OracleS, query can be delivered to converter 
module directly. We use another example to show how to achieve it. 
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Consider the following query based on the relational model, and the query 
will be sent to translator module. 

SELECT a. name, b.natne 

FROM employee a, department b 

WHERE a.oid = b.mgr_fk 



The equivalent query of an information source based on object-relational model 
is as follows. 

SELECT b. mgr. name, b.name 
FROM department b 



The query is then sent to converter directly. That is, integrator can directly 
communicate with the converter module and skip the translator module if in- 
tegrator and information source use the same version of database, e.g., OracleS 
(Figure 6). We use the JDBC Drvier for Oracle? now, so only a subset of the 
new features in OracleS can be provided via the converter module. 

5 Conclusions 

In this paper, we first describe a modularized design for wrapper/monitor in rela- 
tional databases. We also show the architecture and fuctions of each module. We 
demonstrate how a monitor component works on a non-cooperative information 
source as correctly as cooperative one. Besides the snapshot algorithm [LGM96], 
there are still other solutions to solve this problem. We create monitors and a 
wrapper for each site; every monitor is generated by corresponding toolkit, which 
reduces a lot of onerous jobs. Next, we use several examples to show the flows of 
message passing when system is initialized, update occured and query arrived. 
Because we use database as the information sources, we can use transactions 
to maintain the order among updates and queries. The sequence number will 
be sent to the integrator component, so integrator can determine whether an 
event is calier than the others. Finally, we demonstrate how a translator module 
works with integrator. Integrator can also communicate with converter module 
directly, because both module use the same interface. 

References 

[AK97] Naveen Ashish, Craig A. Knoblock; "Wrapper Generation for Semi-structured 
Internet Sources.” SIGMOD Record 26(4)- 8-15, 1997 
[Chang94] Yahui Chang: "Interoperable Query Processing among Heterogeneous 
Databases.” University of Maryland technical report 94-67, 1994. 

[Chang96] Chih-Chung Chang, Amy J.C. TVappey; ”A Framework of Product Data 
Management System - Procedures and Data Model.” Master’s thesis, Department 
of Industrial Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C., 
June 1996. 




32 



J.-T. Homg et al. 



[CGMH+94] Sudaxshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly 
Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: ’’The TSIM- 
MIS Project: Integration of Heterogeneous Information Sources.” In the Proceedings 
of IPSJ Conference 1994, 7-18. 

[CRD94] Yahui Chang, Louiqa Raschid, Bonnie J. Dorr: ’’TVansforming Queries from a 
Relational Schema to ein Equivalent Object Schema: A Prototype Based on F-logic.” 
In the Proceedings of the International Symposium on Methodologies in Information 
Systems 1994: 154-163 

[GRVB98] Jean-Robert Gruser, Louiqa Raschid, Maria Esther Vidal, Laura Bright: 
’’Wrapper Generation for Web Accessible Data Sources.” In the Third IFCIS Con- 
ference on Cooperative Information Systems (CoopIS'98) 1998. 

Also see ftp://ftp.umiacs.umd.edu/pub/louiqa/BAA9709 /PUB98/CoopIS98.ps 

[HBGM-t-97] Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana 
Yemeni, Markus M. Breunig, Vasilis Vassalos: ’’Template-Based Wrappers in the 
TSIMMIS System.” In the Proceedings of the Twenty-Sixth SIGMOD International 
Conference on Management of Data 1997: 532-535 

[HGMW-l-95] Joachim Hammer, Hector Garcia-Molina, Jennifer Widom, Wilburt 
Labio, Yue Zhuge: ’’The Stanford Data Warehousing Project.” In the IEEE Data 
Engineering Bulletin 18(2): 41-48 (1995) 

[LGM96] Wilburt Labio, Hector Garcia-Molina: ’’Efficient Snapshot Differential Algo- 
rithms for Data Warehousing.” Proceedings of VLDB Conference 1996: 63-74 

[LPTB-l-98] Ling Liu, Calton Pu, Wei Tang, Dave Buttler, John Biggs, Paul Ben- 
ninghoff, Wei Han, Fenghua Yu: ”CQ: A Personalized Update Monitoring Toolkit”. 
In the Proceedings of the ACM SIGMOD, May, 1998. 

Also see http://www.cse.ogi.edu/DISC/CQ/papers / sigmod-demo.ps 

[PGGM-l-95] Yannis Papakonstantinou, Ashish Gupta, Hector Garcia-Molina, Jeffrey 
D. Ullman: ”A Query Translation Scheme for Rapid Implementation of Wrappers.” 
In the International Conference on Deductive and Object-Oriented Databases 1995: 
161-186 

[PGMU96] Yannis Papakonstantinou, Hector Garcia-Molina, Jeffrey D. Ullman: ”Med- 
Maker: A Mediation System Based on Declarative Specifications.” In the IEEE In- 
ternational Conference on Data Engineering 1996: 132-141 

[PGMW95] Yannis Papakonstantinou, Hector Garcia-Molina, Jennifer Widom: ’’Ob- 
ject Exchange Across Heterogeneous Information Sources.” In the IEEE Interna- 
tional Conference on Data Engineering 1995: 251-260 

[Wid95] Jennifer Widom: ’’Research Problems in Data Warehousing.” In the Pro- 
ceedings of the 4th Int’l Conference on Information and Knowledge Management 
(CIKM) 1995: 25-30 

[WGLZ-t-96] Janet L. Wiener, Himanshu Gupta, Wilburt Labio, Yue Zhuge, Hector 
Garcia-Molina, Jennifer Widom: ” A System Prototype for Warehouse View Mainte- 
nance.” In the Proceedings of the ACM Workshop on Materialized Views: Techniques 
and Applications 1996: 26-33 

[ZGMH-f-95] Yue Zhuge, Hector Garcia-Molina, Joachim Hammer, Jennifer Widom: 
’’View Maintenance in a Warehousing Environment.” In the Proceedings of the ACM 
SIGMOD Conference 1995: 316-327 

[ZGMW96] Yue Zhuge, Hector Garcia-Molina, Janet L. Wiener: ’’The Strobe Algo- 
rithms for Multi-Source Warehouse Consistency.” In the Proceedings of the Confer- 
ence on Parallel and Distributed Information Systems 1996. 

Also see http://www-db.stanford.edu/pub/papers /strobe.ps 




Managing Meta Objects 
for Design of Warehouse Data 



Takao MIURA^ Wataru MATSUMOTO^ and Isamu SHIOYA^ 

^ Dept. of Electrical and Electronical Engineering, Hosei University 
Kajinocho 3-7-2, Koganei, Tokyo, Japan 
^ SANNO College, Kamikasuya 1563, Isehara, Kanagawa, Japan 



Abstract. In this work, we discnss issues about designing data in ware- 
houses and make clear why meta objects are really important during de- 
signing data warehouse process. Our key ideas are deiyicotiore and queries 
as objects. We harmonize objects and meta objects seamlessly in the de- 
sign process by these ideas. We discuss an experimental prototype system 
called Harmonized Objects and Meta objects Environments (HOME). 



1 Motivation 

Recently much attention has been paid on data warehousing and online analytical pro- 
cessing (OLAP). This is a subject-oriented, non-volatile, time-varying and integrated 
system for decision support in better and faster ways[l]. Relevant information are col- 
lected into repositories. Objects are data which describe some information of interests 
in the repositories. By meta-objects we mean knowledge of data in the repository, and 
sometimes called data about data. 

In database worlds, such kind of information called a scheme have been discussed 
for a long time to obtain rules of classification standards to data and design method- 
ologies. Note that data warehousing requires both traditional database processing and 
environments for database design processing. That is, all the processes are executed in 
trials and errors manner, there happen many changes of database schemes and heavy 
queries to database schemes to obtain design guidelines. Moreover we might need some 
techniques for scheme discovery[3]. In this work, we put an emphasis on manipula- 
tion of meta-objects to establish seamless operations between objects (instances) and 
meta-objects. 

Traditionally we have manipulated scheme contents by means of specialmechanisms 
while data manipulation have been done by means of expression over meta-objects. 
Thus users were forced to separate meta-objects from objects and to utilize them at 
different stages with different languages. More important is that it is hard to obtain 
seamless manipulation of objects and meta-objects such as querying meta-objects under 
given conditions over objects. The lack of the seamless manipulation causes severe 
problems because in data warehousing we always manipulate objects and meta-objects 
equivalently. Especially in conceptual modeling of data warehousing we want to obtain 
views in terms of meta-objects by looking at the objects. 

In this investigation, we propose two basic ideas called reification and deification 
by which we can manipulate objects and meta-objects seamlessly, which have been 
developed originally for logic programming[9]. Then we discuss some algebraic language 
extended for both meta-objects and query evaluation. 
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In the next section, we discuss the basic ideas and the feasibility about the primitive 
features of our approach. Section 3 contains key consideration about our proposed 
language putting stress on scheme management, evaluation of meta-objects and queries 
as values. Section 4 contains related works and we conclude our investigation in section 
5. 

2 MetaObjects and Scenario 

In this investigation we assume an object model for data modelling where every object e 
carries ahnite set r(e) of types as its own intentional information. For tuples (also called 
associations, that means relationship among objects) we dehne some classihcation rules 
called relatwn schemes R (or predicates) over some set Ai..A^ of types^ . We assume 
that every tuple < oi, .., o„ > carries a hnite set r(e) of relation schemes as its own 
intentional information, and that every tuple is consistent with the dehnition of relation 
schemes, i.e., every o, has the type A,. By instance wt mean one of these primitives. 

Our basic ideas of this investigation come from some framework of scheme dehni- 
tions and and interaction among objects. First, we assume a core set of meta-objects to 
keep scheme information and also to describe themselves. By giving a scheme structure 
in advance, we could describe the exact meaning of meta-objects and their manipula- 
tion. 

Second, we introduce queries as values. This is because meta-objects play special 
roles when we evaluate them, that is, we can relate them to instances in warehouses by 
means of evaluation mechanism to meta-objects. Similarly queries correspond to sets 
of instances by the same mechanism though they are not primitive symbols in schemes, 
then queries can give the meaning of instances as if they were meta-objects. 

Third, we discuss the heart of this investigation, reification and ileification[9]. De- 
ification meem an embodiment of meta-objects, that is, evaluation. By this technique 
we relate a meta-object m to a set of instances A4, denoted by $m — A4. For exam- 
ple, given a relation scheme R—R[Ai,..,A,u), $R means a relation (a set of tuples) r 
over Ai x ... x . For an attribute Ai in R, $IIai{R) means the evaluated result of 
J7 ai (fi) while (-R) means the set of tuples in R of which values on Ai are evalu- 

ated again. For example, II[isUfc.me{RelationCatalog) becomes all the relation names 
while n$[isUfc.me{RelationCatalog) generates all the relations dehned in the database. 

When we have a query Q, $Q means the result of the evaluation. For a tuple 
fi(oi...Ou.), $fi(oi..Ou.) is dehned as TRUE if < oi..Ou. > is in $R and FALSE 
otherwise. For an object e, $e means NULL if it is terminal symbol. Finally, given 
(meta)objects mi, .., m^, we dehne ${mi, .., m*;} = {$mi, ..., 

Reification is a technique of abstraction to instances, just the opposite operator of 
deihcation. For an object e, we dehne ~e as its intensional information, i.e., r(e). Thus 
we might say ~R is Relllame :name, ~R.Ai is Attrllame :name, ~name is D o mil ame : name 
and "Relllame is Attrllame : name. In the case of tuples fl(oi, .., Ou.), "fi(oi, .., Ou.) is 
dehned to be R while "< oi,..,o„ > is r(< oi,..,o„ >). For query expression Q, 
~Q contains a set of query domains whose value types are QUERY in r(Q). Then we 
dehne ~\mi, ..,mk} — { "mi, ..., "mj, }. Here we don’t discuss reihcation of sets of 
instances'’’ . 

® They are also called attributes, denoted by A, : D, where A, means a role in R on 
which we put an emphasis and D, means some type. 

Readers might say how we can think about a set of instances obtained by evaluating 
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After these discussion, we introduce our algebraic language where we assume core 
part of the scheme structure within the language, i.e., it knows the scheme structure 
as common knowledge. Then operations over meta-objects could be dehned by special 
(meta) semantics. We make algebraic expressions extended with $ and ~ notations. 
Thus we can specify queries over meta-objects as well as objects and the interaction 
by means of reihcation and deihcation. 

3 Managing MetaObjects 

In this section we describe the total architecture of our approach to harmonize objects 
and meta-objects. First we show how we can manage meta-objects in our scheme 
structure. Next we talk about queries as values : how we describe and evaluate them. 
Then we introduce some devices for reihcation and deihcation and we dehne extended 
relational algebra. Also we show the feasibility of this language. 

3.1 Core Set of Scheme Structure 

We describe our scheme structure to make clear the meaning of exact semantics of 
the scheme. Our primary purpose of this assumption is, in fact, we want to dehne our 
scheme within our own scheme structure. This means, if we dehne the treatment about 
our special scheme (by some softwares), we can dehne et/ery scheme structure in terms 
of our scheme structure. To do that, it is enough to show some materialization of the 
scheme structure which is consistent with our assumption. And this is called core set 
of the scheme structure. 

Remember we have 3 relations in our scheme; Relat ionCalalog, At trbnte Catalog 
and DomainCatalog. Thus we expect that Relat ionCatalog relation should contain 
3 tuples which correspond to these 3 relation schemes. We assume name domain has 
values of 32 byte long and numeric count and type domains have 4 byte integer. Then, 
for example, Relat ionCatalog has 40 byte tuples because of 1 name domain and two 
count domains. In a similar manner, we have the specihc Domllame relation as below: 
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AttributeCatalog relation should contain 11 tuples for the 11 attributes of the 3 
relation schemes: 
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a meta-object m or a query Q. Generally the set doesn’t carry all the semantics 
within but we might mine them by means of knowledge discovery techniques such 
as intentional queries. We don’t discuss this problem any more in this investigation 
but is open[3]. 
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Note the type vEdues CHARACTER, INTEGER, TYPE eire 1,2,3 respectively describ- 
ing what kind of values on this domain are held. In our prototype system, they are 
represented as string, integer and integer respectively. 



3.2 Queries as Values 

Queries can be seen as a kind of values, but we can evaluate them. Since we assume (ex- 
tended) relational algebra, reader might imagine the algebraic expressions like ’’project 
[A,B] select [r] R(ABO” as attribute values. It is worth noting that readers have 
to define scheme of these queries, that is, all the queries on an attribute must have 
compatible attributes (i.e., the Scime set of domains) in their output. For example, all 
the queries must be compatible with ”A,B” on this query attribute in the above case. 
Prom the viewpoint of users’ definition, the attribute seems to have table values over 
the common set of domains. 

Let us discuss what kinds of relationship this feature have to meta-objects. We 
have a query Q with a name V and assume they are registered as a tuple < V, Q > in 
View relation scheme over ViewName x ViewQuery that is meta-object in a database 
scheme. Then we can consider the evaluation of Q as if the one of V. V is not really a 
relation scheme because there is no materialization in the database but through view 
relation. 

Now we assume query values on some attribute but not materialized. If we don’t 
evaluate the queries, then readers can see the queries but not table values. If we evaluate 
queries, readers see the table values but not the queries. Both ceises might happen since, 
in database design, we need a lot of view definitions and their results. Note we do not 
talk about nested relations that are materialized, but we calculate them dynamically 
through stored queries. That means some changes cause the modification of the table 
values immediately. The difference comes from when we evaluate the queries. This is 
the reason why we introduce deification. 

EXAMPLE 1 We assume a relation Student (Name, Address, Friend(Identifier,Home)) 
where Friend is a query attribute over name and address domains. Also we assume two 
other relations TV (FirstName, FamilyName, Address, Age, Female) for TV idols and 
and Sports (FamilyName, NickName, Job) for sports champions. 

Name Address 

Frlend( Identifier » Home) 

HIURA Kawasaki 

"project [FirstName, Address] select [Sex*”female" , Age»20] (TV)" 

HATSUKOTO Tokyo 

"project [FamilyName, Address] select CJob»"Sumo"3 (Sports)" 

Note that, in the relation, there are two table values but they have compatible structure. 
That is, Identifier over naune is compatible with FamilyName and FirstName. Similarly 
Home is compatible with Address. 

After evaluating the relation above, we will get the (virtual) table below : 



Name Address Friendddentif ier ,Home) 



HIURA Kawasaki 






MATSUMOTO Tokyo 







□ 
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Since we don’t waait to distinguish objects from meta-objects, we make up any 
queries consisting of the two types of symbols in query expressions. As we have pointed 
out, here we need some mech 2 inisms to evaluate meta-objects and queries. We do that 
by means of deification, denoted by $ symbol. Relation schemes with $ mean sets of 
tuples while attributes with $ correspond to the set of tuples of which the values on 
the attributes are evaluated individually. 

EXAMPLE 2 In our previous example of Student (Name, Address, FriendCIdentifier.Home) ) , 
we will obtain the relation containing query values if we evaluate Student, i.e. , SStudent (Name , 
Address, Friend). 

When we want to obtain the one containing the tables values in Friend attribute, 
we evaluate the attribute first and then Student, that is, SStudent (Name , Address, 
SFriend) . □ 

To define such domains, we introduce QUERY as one of TypeValue and we should de- 
fine s domEtin whose TypeValue is QUERY. In our prototype system described in the next 
section, the value is described in a form of tagged structure like MIME (Multi-purpose 
Internet Mail Extension) thus it has the fixed size in its length that is implementation- 
dependent. (Our implementation uses 128). More important is that we have to extend 
scheme structures: how can we describe schemes of query values ? Note the core set 
remain unchanged since no query value appears there. 

Given query attributes such as Friend, we must have some expression in our scheme 
structure. First of all, an attribute Friend is defined over a special domain askfriend 
in ArrtibuteCatalog. This mean, in turn, domain values can be obtained by evaluating 
the queries of which definition described as a (virtual) relation. In the above example, 
we have Student . Friend entry in RelationCatalog and the relevant attributes in 
AttributeCatalog. This is nice and enough for our purpose since query domain may 
contain query domain inside. 

EXAMPLE 3 Let us describe Friend attribute in our scheme structure. We assume 
every address value has 32 .byte long. First of all we must have the definitions of 
Student information. In RelationCatalog and AttributeCatalog, we have 



Since Friend has an askfriend domain of QUERY values, RelationCatalog must 
contain Student . Friend entry. In a similar mamner, AttributeCatalog relation should 
contain Identifier and Home attributes for Student .Friend relation. Thus we have; 



DomainCatalog now contains askfriend entry and address in DomName entry where 
the ValueType are QUERY and CHARACTER respectively. □ 
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3.3 Queries as Domains 

To go one step further, we introduce a query os domain. That is, we will dehne domain 
by query. This is really useful because very often identical or very similar queries appear 
on query attributes. In this case we give the dehnition on the query attribute but not 
store them as values. The idea comes from the fact that some value can be obtained 
by other attributes or relations, and we dehne view domain of which domain is dehned 
by a query but does not appear as materialized values in tuples. 

EXAMPLE 4 Assume we have a relation scheme StudentSc ore (Ilame , Address, 
Summary (Course , Score ) and anotherrelationEnroll(Ilame , Course, Score). And we 
want to classify StudentSc ore relation by each student. 

Na.me Addre s s 

Summary (C our c. e , S core ) 

MIURA Ka.wa.sa.fci 

"project [Course , S corej select [Na.me = "M IURA"J (Enroll)" 

MATSUMOTO Tofcyo 

"project [Course , S core] select [Na.me = "MATSUM0T0 "J (Enroll)" 

Looking at the query values, readers see the difference is just the Ilame condition. We 
dehne Summary attribute by a query "project [Course , Sc ore] select [IIame = "'/,T'] 
(Enroll)" where '/,! means the value in the hrst position of this scheme. 

Na.me Address 

Summa.ry : " pro j e ct [Cour se , S cor ej select [Na.me = "U 1"J (Enroll)" 

MIURA Ka.wa.sa.fci 

MATSUMOTO Tofcyo 

By means of deihcation, we obtain StudentSc ore relation by SStudentScore that con- 
tain query values. But SStudentScore (Ilame , Address, SSummary) now contains query 
results in a Summary attribute. □ 

The parameters should be one of the values in each tuple (described as of the 

relation scheme that dehnes the query domain or the tuple itself (described as '/,0), but 
self reference is not allowed. We note this query-domain supports GROUP-BY feature in 
SqL that is known really important in data warehousing[l], and in our case, no special 
syntax in introduced but a query domain. 

To dehne such domains, we introduce qUERYDOMAIII as TypeValue and temporary 
domains queryrerere [nnn — 000,001,...) as DomainCatalog entries. Given query domains 
such as Summary, an attribute Summary is dehned over a special domain queryrerere in 
Arrt ibuteCatalog just same as the case of queries as values. But this time the domain 
name is queryrerere by which we can obtain the pointer to the dehnition of query through 
queryDef inition : QueryDef init ion[rerere] = pointer to the definition 

Note that our scheme structure is closed by reification and deification operators. 
That means no other meta-objects except the core scheme are required. Thus we can 
dehne our scheme semantics within our framework. 



3.4 Extending Relational Algebra 

Let us dehne extended relational algebra for the purpose of reihcation and deihca- 
tion. Here we discuss " sale ct [cond] expr" (which means the execusion (er;pr)), 
"project [attrs] expr" (iJaiiri (er;pr)), "join [cond] exprl expr2" [expri t^cond 
expr 2 ), "union exprl expr2" [expn U expr 2 ), "intersect exprl expr2" [expn n 
expr 2 ) and "difference exprl expr2" [expn — expr 2 ) over relation schemes R (or 
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In this work, in cond part of select operator, we discuss only boolean 
combination of primitive conditions of the form of Attr\ = Attri and Attr = "const". 

We extend th^ operators to input expressions. For example, CTcon<i({i2i, •••) Rn}) 
means O means O Fj j t — 

${Ei,..,En} means {$£i, and so on. 

We define $Q for a query Q as the evaluated results (sets of tuples) by Q. That is, 
for a relation scheme R, $R means the set r of tuples of R, denoted by [/J]. And, for 
instance, $(select [cond] expr) means the result by $ to the result of acond(expr). 
Note $$Q is legal if the reult consists of the relation names. 

In the case of project $ plays special role: ’’project [A,$B,$C] expr” means, for 
each tuple f in $expr, we replace t by < t[A],$t[B],$t[C] >. That is, we evaluate B 
and C values of each tuple fi:om expr. For brevity, we will denote ’’project [A,$B,$C] 
R” by R{A,SB,$C). Similarly, in cond part of select, we have special forms of ”$A 
=• $B” and ”A IM $B”. This means, for every tuple from expr of select operation, 
we examine the evaluation results 2 ind compare the equality or the membership. Note 
there are some more select conditions such as set equality (»), membership (IN) and set 
inclusion (CONTAIN). Also note that the results are a collection of relations that might 
be sent to other queries. Also we define reification syntax to our algebra. 

EXAMPLE 5 Here are some example queries of our extended algebra. 

1. join [Student . Friend = Student. Friend] select [Address="Tokyo] Student 
select [Address="Yokohama"] Student 

This query explores all the pairs of Tokyo Students and Yokohama Students who 
have identical queries on Friend. 

2. join [$ (Student. Friend) • $ (Student. Friend) 3 select [Address«"Tokyo] Student 
select [Address="Yokohama"] Student 

This query explores all the pairs of Tokyo Students and Yokohama Students who 
have identical set of friends. 

3. select ["HIURA" IN ($Fr lend) .Identifier] Student (Name, Address, SFriend) 
This query explores all the tuples who have ’’MIURA” as a friend. 

4. project [Name] $ (project [RelNane] select [AttrHame*"Name"] AttributeCatalog) 
We obtain all the Name values that appear in any relations. 

□ 



4 Related Works 

Meta-knowledge refers to knowledge about its context. For instance, some notion could 
be described by formal logic where logical consequences and formal proofs are expressed 
by some meta-languages. But its is well known that the provability of the meta lan- 
guages are undecidable, thus we have to abandon the solid foundation of logic by 
means of formal logic. But still we assume the usefulness of meta-knowledge from the 
viewpoint of various applications. 

The aim of meta-knowledge is to make easier the interaction between users and 
the knowledge processing. Such meta-knowledge have been investigated to drive the 
process of knowledge acquisition, automated deduction, problem solving, progreimming 
and so on (see [6] for the variety of the investigation). Among others, in databeises and 
data warehousing these knowledge play integral part of their activities where meta- 
knowledge means database schemes, all the common meaning of databeise instances. 
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Database design is nothing but the construction of meta-knowledge[2]. In data ware- 
houses, meta-objects give 6rtd^re between users and data in the repositories by relating 
meaning to the contents[4, 5], Concept learning from databases is one of the major 
research topics at the intersection of both databases and knowledge processing. This 
topic becomes hot day by day and much attention is paid from the viewpoint of Data 
Mining Methods (see [3] for more detail). When analyzing the description extensively, 
we might generate new schemes by utilizing meta-knowlegde. Such knowledge discovery 
process is called scheme discovery in databases. 

Our series of works show how to obtain new database schemes that are suitable 
for current database instances. We have developed the theories from this point of 
view[7, 8]. 



5 Conclusion 

In this work, we have discussed how to manage data and data about data (or meta- 
objects) seamlessly by means of reification and deification. This could result in suitable 
treatment of repository management and meta-data management which improve de- 
sign and maintenance for data warehousing and general database processing. To do 
that, we developed queries as values and queries as domains and then we have pro- 
posed an extended algebra language. We have developed an experimental prototype 
system named iTOME' (Harmonized Objects and Meta-objects Environment). HOME 
consists of the kernel database system and user interfaces, and we have some applica- 
tion testbeds as well as datawarehousing. Currently we discuss strategy control issues 
and optimization problems in the framework of HOME. 
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Abstract. 

Slice&dice and drilling operations are key concepts for ad-hoc data analysis in state- 
of-the-art data warehouse and OLAP (On-Line Analytical Processing) systems. While 
most data analysis operations can be executed on that basis from a functional point of 
view, the representation requirements of applications in the SSDB 
(Scientific&Statistical DataBase) area by far exceed the means typically provided by 
OLAP systems. In the first part of the paper, we contrast the data analysis and 
representation approaches in the OLAP and SSDB field and develop a generalized 
model for the representation of complex reports in data warehouse environments. The 
second part of the paper describes the implementation of this model from a report 
definition, management and execution perspective. The research and implementation 
work was executed in the data warehouse project at GfK Marketing Services, a top- 
ranked international market research company. Various examples from the market 
research application domain will demonstrate the benefits of the work over other 
approaches in the data warehouse and OLAP domain. 



1 Introduction 

Data warehousing and OLAP (On-Line Analytical Processing) are two closely related 
key technologies to support the development of computer-aided management 
information systems. The aim of developing such systems dates back to the 60fe; 
since then, a variety of approaches has appeared over time under different terms like 
Management Information Systems (MIS; [1]), Executive Information Systems (EIS; 
[5]), and Decision Support Systems (DSS; [7]). Data warehousing and OLAP are 
considered to be the key to overcome the two major drawbacks of earlier approaches 
towards of computer-aided management information systems: 

• lack of access to integrated and consolidated data and 

• lack of adequate, intuitive data analysis methods and user interfaces. 

The first issue is addressed in data warehousing, where providing ,^ubject-oriented, 
integrated, time- varying, non- volatile'* data for further analysis ([6]) is the target. The 
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second issue is the focus of modern OLAP systems, where operations like slicing, 
dicing, and drilling are offered to the user in a table-oriented user interface metaphor 

([ 3 ]). 



The common logical grounds of data warehouse and OLAP systems is a distinction 
between quantifying and qualifying data. The former represent empirical data 
collected in the field (e.g. sales figures of products in certain shops over a given 
period of time), while the latter are descriptive data necessary to assign a meaning to 
the quantifying data. In the example above, qualifying data would describe the 
products, shops and time periods in a way that enables various segmentations of the 
base data according to application-oriented criteria (e.g. sum of sales of a product 
category in a distribution channel). To allow for such segmentations, qualifying data 
are organized in so-called dimensions with classification hierarchies defined upon 
them. For example, in the product dimension, single articles may be classified into 
product groups, product groups into categories, and those into sectors. Figure 1 shows 
an example of some instances of the article, product group, and category classification 
in the product dimension. Note that secondary classification attributes (e.g. Brand, 
VideoSystem, AudioSystem) are assigned to specific instances of the primary 
classification (e.g. Video); also note that different instances of the primary 
classification scheme have different secondary classification schemes assigned. We 
will come back to this important observation later; more details on this topic can be 
found in [8]. 



Video 



Brand: Sony JVC, Giundig Sanyo 

Video system: Videos, HiS, VHSO, VHS, S-VHS 
Audio s^tem; Mono, Stereo 



Camcorcler j 

y\K 

Brand: Sony JVC, Sanyo 

Video system: Videos, HiS, VHS-C 
Audio system: Mono, Stereo 
Viewfinder BM(, Color 



VCR 



Brarxl: Gnjndg Sony 

Video system: VHS,S*VHS 
Audio system: Mono, Stereo 
Video heads: 2, 4 



/. 



A. 



TR-780 




TRV-30 




GR-AX200 




GV-500 




SLV-E8CX) 



Brarxl: Sony 


Brarxl: Sony 


Brarxl: JVC 


Brarxl: Gmndig 


Brarxl: Sony 


Video system; HiS 


Video system; Vid^ 


Video system: VHS-C 


Video system: VHS 


Video system: VHS 


Audio s^tem; Stereo 


Audio s^tem; Mono 


Audio s^tem; Mono 


Audio system: Mono 


Audio system; Stereo 


ViewleTder BW 


View firxler Cobr 


Viewfinder BM 


Video heads; 2 


Video heads: 4 



Fig. 1. Example of a product dimension classification 

With one or more classification hierarchies defined upon every dimension, data 
warehouse and OLAP data can now be visualized as a multi-dimensional data cube, 
where the cells hold the quantifying data (often called facts), while the qualifying data 
describe the axes of the cube and can be used for addressing individual cells or groups 
of cells (Figure 2). 
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Fig. 2. Multi-dimensional data cube 



With these basic data modeling concepts introduced, we can now switch to a 
reporting-oriented view. 



2 Multi-dimensional Data Analysis 



Whilst data warehousing is primarily targeted at providing consolidated data for 
further analysis, OLAP provides the means to analyze those data in an application- 
oriented manner. According to the underlying multi-dimensional data view with 
classification hierarchies defined upon the dimensions, OLAP systems provide 
specialized data analysis methods. The basic OLAP operations are: 

• slicing (reducing the data cube by one or more dimensions), 

• dicing (sub-selecting a smaller data cube and analyzing it from different 
perspectives) and 

• drilling (moving up and down along classification hierarchies). 



In the latter case, different instances of drilling operations may be distinguished: 



• drill-down: 

• drill-up: 

• drill-within: 



switching from an aggregated to a more detailed level within the 
same classification hierarchy; 

switching from a detailed to an aggregated level within the same 
classification hierarchy; 

switching from one classification to different one within the same 
dimension; 



• drill-across: switching from a classification in one dimension to different 

classification in a different dimension. 



The results of these operations are typically visualized in cross-tabular form, i.e. 
mapped to a grid-oriented, two-dimensional layout structure. Higher-dimensional data 
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are mapped to this two-dimensional layout by nesting different dimensions, which 
conflicts with the concept of orthogonality of the dimensions, but may be tolerated in 
the final data visualization step of an OLAP data analysis session. In Figure 3, the 
application of a drill-down operation to a three-dimensional cross-tab grid is 
graphically exemplified. 
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Fig. 3. Drill-down operation in a cross-tabular layout structure for three-dimensional data 

For many years, cross-tabular report structures have proven to be an adequate means 
for batch-oriented reporting, particularly in the SSDB (Scientific&Statistical Data 
Base) area. In those systems, the process of ,ponstructing“ the final report structure 
was of minor interest. In OLAP systems, however, user support for interactive data 
analysis, i.e. navigation through the data cube by applying the OLAP operators 
described above, is of vital interest. Of the 12 OLAP rules defined by E.F. Codd ([4]), 
some directly relate to that area: 

Rule 1 : multi-dimensional conceptual views 
Rule 10: intuitive data manipulation 
Rule 1 1 : flexible report generation 

Additional requirements for interactive data analysis systems are found in the 
literature on design rules for information systems ([10]). They include: 
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• Information should be associated with their underlying definitions. 

• Summary and detail information should be visibly separated from one another. 

Together, these rules and requirements lead to four basic criteria which allow to 
assess the quality of the GUI design and the underlying user interaction principles in 
OLAP systems: 

• user interface and user interaction concept for data navigation and analysis via 
slice&dice techniques; 

• user interface and user interaction concept for data navigation and analysis via drill 
techniques; 

• visualization concept for user-defined drill reports; 

• visualization concept for heterogeneous report structures. 

In the next section, these criteria will be used to identify strengths and weaknesses of 
state-of-the-art data warehouse and OLAP systems. 



3 Report Functionalities in State-of-the-Art Data Warehouse and 
OLAP Systems 

Data warehouse and OLAP are nowadays offered in a great variety in the 
marketplace. All major database system providers have specific offers in their product 
portfolio. In addition, there exist numerous independent software houses specialized 
in this domain. Thus, it does not make sense to evaluate the different systems 
individually; instead, the evaluation criteria mentioned in the previous section will be 
discussed in a generalized manner. 

The slice&dice functionality in current OLAP systems is mostly implemented using 
separate drag&drop and selection dialogues. This conflicts with Rule 10 of Codd’s 
OLAP rules, as the user is forced to use different contexts (e.g. report window, 
selection window) in order to complete the task of report specification. A better 
concept would be to manipulate directly the visual representation of the report object 
by drag&drop operations. The attributes used to specify the slice&dice functions 
would be ideally presented in a tree-like selection list, as they are typically organized 
hierarchically. 

Drill operations are offered to the users in the majority of current systems by selecting 
the drill anchor cell in the report object first and then by specifying the desired drill 
function from a context-sensitive menu. This approach is judged as quite straight- 
forward to use by most OLAP users. A functionally equivalent alternative would be to 
use the same concept as the one for slice&dice functions, i.e. selecting the target 
attribute for the drill operation from the tree-like attribute selection list and then 
dragging&dropping it onto the drill anchor cell. 

User-defined drill reports are visualized in OLAP systems either within the original 
report object or in a separate report object. The first alternative is clearly to be 
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favored, as only then the user can relate the drill data directly to the original drill 
anchor data. As a side remark, it should be noted that many systems, particularly 
those based on the MOLAP (Multi-dimensional OLAP) concept, only offer to 
traverse a pre-specified drill sequence, as aggregated data for drill operations are 
computed at session start. ROLAP (Relational OLAP) systems, on the other hand, 
typically allow for an ad-hoc specification of drill attributes, as aggregated data are 
computed dynamically during a session. 

A joint visual representation of structurally heterogeneous report objects is a widely- 
used concept in the SSDB application domain. This approach guarantees that user 
reports have a compact layout, thus maximizing the amount of information 
presentable in one visible portion, e.g. a computer screen. Handling heterogeneous 
report components in separate report objects is facilitated by concepts like tab folders 
known from spreadsheet programs in modern GUI environments, but many users still 
prefer to „see everything immediately on one screen". This is particularly true for 
typical MIS / EIS / DSS users mentioned in the introductory section. In Figure 4, an 
example of a heterogeneous report with multiple drill instances is shown. Its different 
components will be explained in more detail in the following section. 
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Fig. 4. Heterogeneous report with multiple drill instances 



In summary, the user interaction concept of most state-of-the-art data warehouse and 
OLAP systems is primarily geared at computer-literate users who are familiar with 
using context menus, multi-window systems, tab folders, and the like. For many other 
typical users of such systems, a more WYSIWYG(What You See Is What You Get)- 
like work style, where every operation is directly performed with the final report 
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object, would be more intuitive. Finally, the requirements of many users for extremely 
compact, yet highly customizable information content within a report, is not 
sufficiently supported in most current systems, whereas Statistical Databases offer 
the required functionality since a long time ([14]). 



4 Object Model for Complex Reports 

The report shown in Figure 4 is a real example from GfK, one of the largest market 
research companies in the world. The sample report contains a number of modeling 
challenges for any kind of reporting system; 

• drillings with (e.g. case ‘27 INCHES”) and without (e.g. case ‘21 INCHES”) 
replacement of the drill anchor 

• multiple drillings for the same drill anchor (e.g. case ‘25 INCHES”) 

• cascading drillings for a drill anchor (e.g. case ‘27 INCHES”) 

• „OTHERS‘‘ as the computed aggregate of all elements not explicitly selected 

• multiple report anchors at the top report level 

Reports of this kind cannot be generated in a joint layout structure with current data 
warehouse and OLAP systems. The current solution at GfK holds individual 
descriptions for every single report line, specified in a code scripting language 
relating to VS AM file positions and code values. This approach is highly error-prone 
due the lack of semantic links between the individually defined report line 
specifications. The number of report line specifications adds up to several ten 
thousands, which makes maintenance a high-cost effort. 

As the system used at GfK has clearly reached its limits, a decision was made to 
implement a new report management system based upon data warehouse and OLAP 
technologies. During modeling sessions with a number of tool providers, it turned out 
that no current system covers all of the requirements mentioned above. Therefore, it 
was decided to develop a report object model on top of an existing data warehouse 
platform, which is then mapped to lower-level report objects provided with the 
selected data warehouse platform (in this particular case the DSS Suite from 
MicroStrategy). The report object model will be described in this section; some 
remarks on the implementation will be given in Section 6. 

In Eigure 5, the report object model developed at GfK is shown in an UML notation 
(for details on UML, see [2] and [13]). The model adopts the structure of a statistical 
table and distinguishes on a high level between headings and a global filter. The 
global filter describes which data are included in the analysis of the specified report. 
The top and side headings are first decomposed individually into their independent 
components (case (6) in the above example) and those, in turn, level by level 
according to the drill structures of the component. The heading level information class 
instances describe the facts or attributes used in the different report components. If the 
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instance describes an attribnte, a snb-selection of data elements may be specified in an 
attribnte filter class instance. 

It shonld be noted that in the report object model cascading drillings are not specified 
recnrsively in the heading level information business class, but mapped to a sequential 
description within a heading level information business class instance (1). This allows 
for a compact description of complete drill splits, i.e. if a drill is applied to all 
instances of an attribute, the drills do not have to be specified instance by instance. 
For the same reason, multiple drill anchors are specified not as separate instances of 
the attribute business class, but are propagated to the attribute filter business class and 
jointly modeled there. Also note that facts (original or derived quantifying data) 
cannot be specified within a drilling. In other words, the fact stmcture of the report is 
determined globally in the heading level information class, which may also be 
interpreted as an inheritance of the anchor instance fact description to the drilling 
children. This guarantees for compatible fact structures across different drilling levels. 
Finally, it should be mentioned that the computation of Totals and Others is also 
specified individually for the different heading level and drills within the report object 
model. 




Fig. 5. Report object model for heterogeneous reports 



In Figure 6, some core parts of the report object model describing the heterogeneous 
report shown in Figure 4 are depicted. The global filter is set to Productgroup=“CTV“ 
and Country= “Austria". The top heading is decomposed into two separate heading 
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components, of which the first one is specified down to the leaf nodes. For the side 
heading, only the first report component containing different drills is shown in some 
detail. Note that the modeling of the double drill to format and frequency for the 27 ‘ 
drill anchor must be executed as a cross-product operation on an implementation 
level. 
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Fig. 6. Object model of the sample report 
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The report object model introduced in this section is complete in the sense that it 
covers all the situations relevant in real-world complex report applications. The only 
restriction the model imposes is that some compatibility between fact structures must 
be guaranteed across a report. The model strictly separates structural, attribute and 
instance information. On this basis, the model can be mapped to report object models 
found in state-of-the-art data warehouse and OLAP platforms, thus limiting the 
amount of implementation work to the minimally needed amount. Before describing a 
concrete implementation of the model on top of MicroStrategy’s DSS Objects system, 
the basic concepts of the Graphical User Interface design will be described in the next 
section. 



GUI Design Principles for Complex Reporting Applications 

Finding an adequate internal representation of the complex report structures needed 
for real-world data warehouse and OLAP application is only one side of the coin. 
Equally important is to find the right usage metaphor that allows a user to easily 
understand the system and thus to exploit its full potential. 
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It was described earlier in this paper that a WYSIWYG-like usage concept is 
considered as the best choice for a complex reporting application. The advantages of 
such an approach are two-fold: 

• during the report design phase, the user is guided step by step by immediately 
seeing the results of this operations; 

• for the final report, it is guaranteed that the information shown on the report is self- 
containing, i.e. no background information is necessary to interpret the report. 

In addition, it is desirable that the system dynamically adopts the range of possible 
selections for every report construction step to the context the user is creating within 
the report object. This is particularly important in the presence of dependent 
secondary classifications, as is the case, for example, for the product dimension in a 
market research application (cf. Section 1). 

The general GUI layout of the GfK reporting system is depicted in Figure 7. It 
consists of four major areas: 

1 . toolbar area for access to administrative functions, 

2. component selection area, representing objects in a treeview structure, 

3. global report filter area and 

4. report instance area, subdivided into top heading, side heading and data area 
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Fig. 7. Screen layout of the GfK reporting system 



The usage paradigm of the system is that the user constructs the report by 
successively dragging and dropping attributes from the component selection area into 
the global report filter and report instance area. The work style with the system is 
exemplified for a drill operation in Figure 8. 
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Fig. 8. Specification of a drill operation 



In the first step, the Month attribute is selected from the components list and dragged 
into the report instance area. When the element is dropped onto the 1997 instance of 
the previously included Year attribute, a selection list open which allows for a sub- 
selection of instances to be shown in the report, and for the specification of a 
particular sort order in the final report. After closing the selection list, the report 
object is updated to reflect the changes the operation implies to the report design. 



Implementation 

The complex object model described in this paper forms the core of the new GfK 
reporting system. This system is implemented on top of MicroStrategy’s DSS Suite, 
in particular the DSS Objects system ([11]). DSS Objects offers an API (Application 
Programming Interface) to access the various functions needed to implement a 
ROLAP system. Application code is written in VisualBasic, as computing-intensive 
part of the application are run on the underlying Oracle database system. 

The report model of MicroStrategy’s DSS Suite consists of three basic concepts: 

• templates, which describe the layout structure of a report as a combination of 
dimensional attributes, along with information on the facts to be shown, 

• filters, which describe the data that go into a report as a collection of attribute 
values from the different dimensions, and 

• reports, which are combinations of a template and a filter object. 

The strict distinction between attribute and attribute value references in DSS’s 
template and filter structures, respectively, enables a high degree of re-use for those 
components, as templates and filters may combined in an arbitrary manner. However, 
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this concept is only applicable when the dimensional descriptions are the same for all 
elements of a dimension. In Section 1, it was shown that this assnmption is not valid 
for many real-world applications. In our market research example, the featnre 
attribntes are valid only within a specific context, e.g. a prodnct gronp. In those 
sitnations, combining a template containing references to attribntes like 
,AudioSystem“ and „VideoSystem“ with a filter selecting the dishwasher prodnct 
gronp, for example, wonld resnlt in an empty report. 

To overcome the problems mentioned above, two steps are necessary: 

• dnring report design, ensnre that only those attribntes who are valid within the 
already specified context are offered to the user, and 

• for report execntion, decompose the complex report strnctnre defined on 
application level into basic report objects that can be handled by the nnderlying 
data warehonse / OLAP tool, and re-combine the partial resnlts into a single report 
object at nser level. 

We will only elaborate on the latter issue here; the former issue is a modeling issne 
discussed elsewhere in more detail ([12]). 

In Figure 8, the interaction between an application-level complex report object and 
diverse tool-level basic reports is exemplified for a very simple case. The 
transformation of the complex report object is technically performed by decomposing 
the internal object representation of the complex report into a semantically equivalent 
set of DSS reports, which are executed in the DSS Objects environments. To do that, 
DSS Objects transforms the API call for a report execution in an SQL statement, 
which is then transmitted to the underlying database server and executed there. The 
results are handed back to the VisualBasic application as a two-dimensional array 
structure with heading infonnation for every data cell. From the different data arrays 
containing the results of the different DSS Objects calls, the application picks the data 
needed cell after cell. 

It shonld be intnitively clear that for real report objects needed in application like 
market research, the nnmber of tool-level reports needed to generate a single 
application-level report may become enormous. However, the tool-level reports are 
then nsnally qnite small, i.e. they touch only a relatively small number of instances. 
Together with an intelligent data aggregation mechanism, the needed system 
performance is still achievable ([9]). 
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Fig. 9. Interaction between application level complex report objects and tool level basic reports 



Conclusion 

Complex reports needed in real-world reporting applications are not sufficiently 
supported by state-of-the-art data warehouse and OLAP platforms. The usage 
principles found in those systems are too computer-centric and do not meet the 
requirements of many of the typical users of the such systems, in particular, the user 
demand for compact information presentation. In this paper, we have shown how 
existing tools can be extended by a complex report management layer built on top of 
the platform API. 

The current implementation of the ideas presented in this paper within the GfK data 
warehouse project is close to completion. Test users appreciate the easy-to-use user 
interface of the system, in particular the context-driven user guidance through the 
feature-extended multi-dimensional data cubes. On the performance side, we did not 
go to the full limits of the application so far. However, should we experience serious 
performance problems, there are a number of performance tuning opportunities 
intrinsic to our approach besides the usual, system-oriented measures like partitioning 
and indexing. As complex reports are decomposed into relatively small and 
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Structurally simple reports, chances are high that those low-level reports may be re- 
used over and over again for different application-level reports. We are currently 
integrating into our architecture the results of some research towards a self-adapting 
aggregation management system. 
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Abstract. Profiling customers’ behavior has become increasingly important for 
many applications such as fraud detection, targeted marketing and promotion. 
Customer behavior profiles are created from very large collections of transac- 
tion data. This has motivated us to develop a data-warehouse and OLAP based, 
scalable and flexible profiling engine. We define profiles by probability distri- 
butions, and compute them using OLAP operations on multidimensional and 
multilevel data cubes. Our experience has revealed the simplicity and power of 
OLAP-based solutions to scalable profiling and pattern analysis. 



1 Introduction 

Profiling customers’ behavior aims at extracting patterns of their activities from trans- 
actional data, and using these patterns to provide guidelines for service provisioning, 
trend analysis, abnormal behavior discovery, etc. It has become increasingly important 
in a variety of application domains, such as fraud detection, personalized marketing 
and commercial promotion. It has also given rise to the need for a scalable infra- 
structure to support filtering, mining and analyzing massive transaction data continu- 
ously [1],[2],[4]. We have developed such an infrastructure with data-warehousing and 
OLAP technology. 

In this paper we will focus on the construction and application of customer behav- 
ior profiles from telephone call data for the purpose of fraud detection. Typically, a 
customer’s calling behavior is represented by the composition and periodic appearance 
of his call destination, time-window and duration. One way of doing fraud detection is 
to discover abnormal calling behavior, which may be further classified into the fol- 
lowing two categories. 

Threshold based fraud detection. For example 

□ a call is suspicious if its duration > 24 hours, 

□ a call is suspicious if its duration > 4 hours and it is made in the evening. 

Pattern based fraud detection. For example, 

□ a caller (identified by phone number) is suspicious if his calling pattern is similar to 
a previously known fraudulent one. 

Profiling callers’ behavior is significant for both kinds of fraud detection. In threshold 
based fraud detection, without information about personalized calling behavior, only 
generalized thresholds may be set, such as to consider a call to be suspicious if it lasts 
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over 24 hours. With the availability of a customer’s calling behavior profile, person- 
alized, rather than generalized thresholds can be set, so that 

□ calls by John for 4 hours are considered usual, but 

□ calls by Jane for 2 hours are considered unusual. 

Thus, personalized or group-based thresholds can be used to provide more precise 
fraud detection than generalized thresholds. Similarly, pattern based fraud detection is 
based on profiles to do pattern matching. Each new customer’s calling behavior is 
profiled and compared against known fraudulent profiles. Customer profiles are also 
useful for other summary information oriented applications. 

To create and update customer behavior profiles, hundreds of millions of call rec- 
ords must be processed everyday. This has motivated us to develop a scalable and 
maintainable framework to support such profiling. The profiling engine is built on top 
of an Oracle-8 based telecommunication data-warehouse and Oracle Express, a multi- 
dimensional OLAP server. Profiles and calling patterns are represented as multidi- 
mensional cubes and based on the probability distribution of call volumes. The pro- 
filing engine is capable of building and updating customer calling behavior profiles 
incrementally by mining call records that flow into the data-warehouse daily, deriving 
calling patterns from profiles, analyzing and comparing the similarity of calling pat- 
terns. We have demonstrated the practical value of using an OLAP server as a scalable 
computation engine to support profile computation, maintenance and utilization. 

We share the same view as described in [1] and [5], in taking advantage of OLAP 
technology for analyzing data maintained in data-warehouses. Particularly, we are in- 
line with the efforts described in [5] to use OLAP tools to support large-scale data 
mining. However, to our knowledge, there is no prior work reported on OLAP based 
customer behavior profiling and pattern analysis. 

Section 2 introduces the concept of behavior profiling with probability distribu- 
tions. Section 3 describes the architecture of our profiling engine. Section 4 illustrates 
how to compute profile cubes, and analyze and compare calling pattern cubes. Linally 
in section 5 some conclusions are given. 



2 Probability Distribution based Profiling with OLAP 

Lor customer behavior profiling, we first have to decide which features (dimensions) 
are relevant. Lor our calling behavior profiling application, the features of interest are 
the phone-numbers, volume (the number of calls), duration, time of day, and day of 
week for a customer’s outgoing and incoming calls. Next, we have to select the 
granularity of each feature. Thus, the time of day feature may be represented by the 
time-bins ‘morning’, ‘afternoon’, ‘evening’ or ‘night’; the duration feature may be 
represented by ‘short’ (shorter than 20 minutes), ‘medium’ (20 to 60 minutes), or 
‘long’ (longer than 60 minutes). Linally, we have to decide the profiling interval (e.g. 

3 months) over which the customer profiles will be constructed, and the periodicity of 
the profiles (e.g. weekly). Thus, in our application, a customer’s profile is a weekly 
summarization of his calling behavior during the profiling interval. 
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Based on the profiled information, calling patterns of individual customers may be 
derived. Conceptually we can consider the following three kinds of calling patterns. 

A fxed-value based calling pattern represents a customer’s calling behavior with 
fixed values showing his “average” behavior. For example, a calling pattern from 
number A to number B says that on the average calls are short in afternoons and long 
in evenings. 

A volume based calling pattern summarizes a customer’s calling behavior by 
counting the number of calls of different duration in different time-bins. For example, 
a calling pattern from number A to number B says that there were 350 short calls in 
the mornings of the profiling period, etc. 

A probability distribution based calling pattern represents a customer’s calling be- 
havior with probability distributions. For example, a calling pattern from number A to 
number B says that 10% of the calls in the morning were long, 20% were medium, 
70% were short. 

Probability distribution based calling patterns provide more fine-grained represen- 
tation of dynamic behavior than fixed value based ones. They also allow calling pat- 
terns corresponding to different lengths of profiling interval to be compared. 

We represent profiles and calling patterns as cubes. A cube has a set of underlying 
dimensions, and each cell of the cube is identified by one value from each of these 
dimensions. The set of values of a dimension D, called the domain of D, may be lim- 
ited (by the OLAP limit operation) to a subset. A sub-cube (slice or dice) can be de- 
rived from a cube C by dimensioning C by a subset of its dimensions, and/or by lim- 
iting the value sets of these dimensions. 

As mentioned above, the profile of a customer is a weekly summarization of his ac- 
tivities in the profiling period. For efficiency in our prototype system we group the 
information for profiling multiple customers’ calling behavior into a single profile 
cube with dimensions <duration, time, dow, callee, caller>, where dow stands for 
day_of_week (e.g. Monday,..., Sunday), callee and caller are calling and called phone 
numbers. The value of a cell in a profiling cube measures the volume, i.e. number of 
calls, made in the corresponding duration-bin, time-bin in a day, and day of week, 
during the profiling period. In this way a profile cube records multiple customers’ 
outgoing and incoming calls week by week. From such a multi-customer profile cube, 
calling pattern cubes of individual customers may be derived. They have similar di- 
mensions as the profile cubes except that a calling pattern cube for outgoing calls is 
not dimensioned by caller, and a calling pattern cube for incoming calls is not dimen- 
sioned by callee, because they pertain to a single customer. 

Multiple calling pattern cubes may be generated to represent a customer’s calling 
behavior from different aspects. In our design, several calling pattern cubes repre- 
senting probability-based information are actually derived from intermediate calling 
pattern cubes representing volume-based information. 

Let us consider a volume-based cube V for a single customer derived from the 
above profile cube by totaling outgoing calls over days of week. V holds the counts of 
calls during the profiling period dimensioned by <time, duration, callee>, where di- 
mension time has values ‘morning’, ‘evening’, etc; duration has values ‘short’, ‘long’, 
etc; dimension callee contains the called phone numbers. A cell in the cube is identi- 
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fied by one value from each of these dimensions. From cube V the following different 
probability cubes (and others) may be generated: 

□ Cp„ for the prior probability of time-bin of calls wrt each callee, that is dimensioned 
by <time, calleo, and indicates the percentage of calls made in ‘morning’, ‘after- 
noon’, ‘evening’ and ‘night’ respectively. 

□ Cp for the conditional probability of call duration-bin given time-bin of calls wrt 
each callee, that is dimensioned by <time, duration, calleo, indicates the percent- 
age of calls that are ‘long’, ‘medium’ and ‘short’ respectively, given the time-bin. 

□ for the probabilistic consequence of the above, i.e. the probability of calls in 
every cell crossing dimensioned by <time, duration, calleo over the total calls. 

All the above probability cubes, C^„., C , and , can be derived from cube V using 
OLAP operations. In the Oracle Express OLAP language, these are expressed as 

□ C,.;= total(V, time, callee) / total(V, callee) 

□ C^ = (V / C^J / total(V, callee) 

□ V / total(V, callee) 

In the above expressions, total is a typical OLAP operation on cubes with numerical 
cell values. While total(V) returns the total of the cell values of V, total(V, callee) 
returns such a total dimensioned by callee, total(V, time, callee) returns such a total 
dimensioned by time and callee. In fact a dimensioned total represents a cube. The 
arithmetic operations on cubes, such as ‘/’ used above, are computed cell-wise. 

With the above mechanism it is only necessary to make volume cubes persistent 
data-warehouse objects. In the other worlds, only the volume based information need 
to be profiled; calling patterns, either based on volume or probability, can be derived. 



3 Architecture of the Profiling Engine 

The profiling engine provides the following major functions. 

□ Building and incrementally updating customer calling behavior profiles by mining 
call records flowing into the data-warehouse daily, using an OLAP server. 

□ Maintaining profiles by staging data between the data-warehouse and the OLAP 
multidimensional database. 

□ Deriving multilevel and multidimensional customer calling patterns from profiles 
for analysis. 

□ Comparing the similarity of customer calling patterns from volume and probability 
distribution points of view, and generating multilevel and multidimensional simi- 
larity measures, to be used in such applications as fraud detection. 

The profiling engine is built on top of an Oracle- 8 based data-warehouse and Oracle 
Express, an OLAP server (Ligure 1). Call data records, customer behavior profiles and 
other reference data are stored in the warehouse. Call data records are fed in daily and 
dumped to archive after use [3]. The OLAP server is used as a computation engine for 
creating and updating profiles, deriving calling patterns from profiles, as well as ana- 
lyzing and comparing calling patterns. The following process is repeated periodically 
(e.g. daily). 
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□ Call data records are loaded into call data tables in the data-warehouse, and then 
loaded to the OLAP server to generate a profile-snapshot cube that is multi- 
customer oriented. 

□ In parallel with the above step, a profile cube covering the same set of customers is 
retrieved from the data-warehouse. 

□ The profile cube is updated by merging it with the profile- snapshot cube. 

□ The updated profile cube is stored back to profile tables in the data-warehouse. The 
frequency of data exchange between the data-warehouse and the OLAP server is 
controlled by certain data staging policies. 



Store back 




Fig. 1. Data warehouse and OLAP server based profiling engine architecture. 

In order to reduce data redundancy and query cost, we chose to maintain minimal 
data in the profile tables in the data-warehouse. We include multiple customers’ call- 
ing information in a single profile table or profile cube, without separating information 
on outgoing calls and incoming calls. We make the relational schema of the profile 
table directly correspond to the base level of the profile cube. Derivable values at 
higher levels are not maintained in the data-warehouse. 

The OLAP engine actually serves as a scalable computation engine for generating 
profile cubes, deriving calling pattern cubes, analyzing individual calling patterns in 
multiple dimensions and at multiple levels, and comparing pattern similarity. From a 
performance point of view, it supports indexed caching, reduces database access dra- 
matically and extends main memory based reasoning. From a functionality point of 
view, it allows us to deliver powerful solutions for profiling, pattern generation, analy- 
sis and comparison, in a simple and flexible way. 



4 Profile Cubes and Calling Pattern Cubes 

We deal with two general kinds of cubes: multi-customer based profile cubes and 
single customer based calling pattern cubes. 
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4.1 Profile Cubes 

A profile cube, say PC, and a profile-snapshot cube, say PCS, have the same underly- 
ing dimensions, and contain profiling information of multiple customers in direct 
correspondence with the relational tables in the data-warehouse. In the Oracle Express 
language they are defined as 

define PC variable int <sparse <duration time dow callee caller» inplace 

define PCS variable int <sparse <duration time dow callee caller» inplace 
where callee and caller are called and calling numbers; dimension time has values 
‘morning’, ‘evening’, etc, for time-bins; dimension duration has values representing 
duration-bins (e.g. ‘short’); and dimension dow has values representing days of 
week(e.g. ‘MON’). The use of keyword “sparse” in the above definitions instructs 
Oracle Express to create a composite dimension <duration time dow callee caller>, in 
order to handle sparseness, particularly between calling and called numbers, in an 
efficient way. 

Profile-snapshot cube PCS is populated by means of binning. A call data record 
contains fields with values mapping to each dimension of the PCS cube. Such map- 
ping is referred to as binning. Eor example, ‘8am’ is mapped to time-bin ‘morning’, 5 
minutes is mapped to duration-bin ‘short’. A call made at Sam and lasting 5 minutes 
falls into the cell corresponding to time = ‘morning’ and duration = ‘short’. 

Profile cube PC is retrieved from the database and updated by merging PCS, and 
then stored back to database. In Oracle Express, the merge of PC and PCS is simply 
expressed as 

PC = PC H- PCS 

In this way customer profiles are updated incrementally as each new batch of call data 
records flow into the data-warehouse. 



4.2 Hierarchical Dimensions for Multilevel Pattern Representation 

Calling pattern cubes are derived from profile cubes and used to represent the calling 
behavior of individual customers. In order to represent such calling behavior at multi- 
ple levels, Dimensions dow, time and duration are defined as hierarchical dimensions, 
along which the calling pattern cubes can be rolled up. 

A hierarchical dimension D contains values at different levels of abstraction. Asso- 
ciated with D there are a dimension DL describing the levels of D, a relation DL_D 
mapping each value of D to the appropriate level, and a relation D_D mapping each 
value of D to its parent value (the value at the immediate upper level). Let D be an 
underlying dimension of a numerical cube C such as a volume-based calling pattern 
cube. D, together with DL, DL_D and D_D, fully specify a dimension hierarchy. They 
provide sufficient information to rollup cube C along dimension D, that is, to calculate 
the total of cube data at the upper levels using the corresponding lower-level data. A 
cube may be rolled up along multiple underlying dimensions. For example, the dow 
hierarchy is made of the following objects. 




OLAP-Based Scalable Profiling of Customer Behavior 



61 



□ t/ow(day of week): dimension with values MON’, ... ’SUN’ at the lowest level (dd 
level), ’wkday’, ’wkend’ at a higher level (ww level), and ’week’ at the top level 
(’week’ level). 

□ dowLevel: dimension with values ’dd’, ’ww’, ’week’ 

□ dow_dow: relation (dow, dow) for mapping each value to its parent value, e.g. 

dow_dow(dow ’MON’) = ’wkday’ 

dow_dow(dow ’SAT) = 'wkend' 
dow_dow(dow 'wkday') = 'week' 
dow_dow(dow 'wkend') = 'week' 
dow_dow(dow 'week') = NA 

□ dowLevel_dow: relation (dow, dowLevel) for mapping each value to its level, e.g. 

dowLevel_dow(dow 'MON') = 'dd' 

dowLevel_dow(dow ’wkday’) = ’ww’ 
dowLevel_dow(dow ’wkend’) = ’ww’ 
dowLevel_dow(dow ’week’) = ’week’ 

Analogously, the time hierarchy is made up of dimension time', dimension time- 
Level with values 'day', ‘month’, ‘year’ and ‘top’; parent relation timejime and level 
relation timeLevel_time. The duration hierarchy is made up of dimension duration', 
dimension durLevel with values 'dur_bin' and 'dur_all'; parent relation dur_dur and 
level relation durLevel_dur. 

For profile storage, comhination and updating, only the bottom levels are involved, 
therefore rolling up profile cubes such as PC is unnecessary. Rolling up is only appli- 
cable to calling pattern cubes for analysis purposes. 



4.3 Calling Pattern Cubes 

A calling pattern cube is associated with a single customer. As the calling behavior of 
a customer may be viewed from different aspects, different kinds of calling pattern 
cubes may be defined. These cubes are commonly dimensioned by time, duration and 
dow (day of week), and in addition, for those related to outgoing calls, dimensioned by 
callee, and for those related to incoming calls, dimensioned by caller. Their cell val- 
ues represent the number of calls, the probability distributions, etc. Calling pattern 
cubes are derived from profile cubes, say, PC, and then may be rolled up. 

Volume based calling patterns. Cube CB.o represents the outgoing calling behavior 
of a customer. In Oracle Express that is defined by 

define CB.o variable int <sparse <duration time dow callee» inplace 
Similarly, cube CB.d representing incoming calling behavior is defined by 

define CB.d variable int <sparse <duration time dow caller» inplace 
The cell values of these cubes are the number of calls falling into the given ’slot’ of 
time, duration, day of week, etc. When generated, CB.o and CB.d are rolled up along 
dimensions duration, time and dow. Therefore, 
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CB.o( duration ‘short’, time ‘morning’, dow ‘MON’) 
measures the number of short-duration calls this customer made to each callee (dimen- 
sioned by callee) on Monday mornings during the profiling interval. Similarly, 

CB.o( duration 'all', time ‘allday’, dow ‘week’) 
measures the number of calls this customer made to each callee (total calls dimen- 
sioned by callee) during the profiling interval. 

Cubes representing probability distribution based calling patterns are derived from 
volume-based pattern cubes. Depending on the application requirements various cubes 
may be derived. We list below two kinds of calling pattern cubes for outgoing calls. 
Calling pattern cubes for incoming calls can be defined similarly. 

Probability distribution on all calls. Cube P_CB.o for a customer represents the 
dimensioned probability distribution of outgoing calls over all the outgoing calls made 
by this customer, and is derived from CB.o in the following way 

define P_CB.o formula decimal <duration time dow callee> 

EQ (CB.o/total(CB.o(duration 'all', ‘allday’, dow ‘week’))) 
where total) CB.o(duration 'all', ‘allday’, dow ‘week’)) is the total number of calls this 
customer made to all callees (remember that CB.o has already been rolled up, hence 
we can use its top-level value). The value of a cell is the above probability corre- 
sponding to the underlying dimension values. 

Probability distribution on calls to each callee. Cube Pl_CB.o is dimensioned by 
duration, ... and callee, and represents the probability distribution of a customer’s 
outgoing calls over his total calls to the corresponding callee, and is also derived from 
CB.o as specified in the following 

define P 1 _CB.o formula decimal <duration time dow callee> 

EQ (CB.o/total(CB.o(duration 'all', ‘allday’, dow ‘week’), callee)) 
where total(CB.o(duration 'all', ‘allday’, dow ‘week’), callee) is the total number of 
calls this customer made to each callee (dimensioned by callee). The value of a cell is 
the above probability corresponding to the underlying dimension values. 



4.4 Calling Pattern Similarity Comparison 

Calling pattern comparison is important for such applications as fraud detection. Since 
the similarity of customer behavior can be represented from different angles, we com- 
pare calling patterns derived from customer calling behavior profiles, rather than 
comparing profiles directly. For example, some calling patterns might be similar in the 
volume of calls to the same set of callees, others might be similar in the time of these 
calls such as late nights. Our objective, therefore, is to enable the comparison of call- 
ing patterns along multiple dimensions and at multiple levels of the dimension hierar- 
chies. 

Given two input calling pattern cubes, say C, and C^, the output of the comparison 
is a similarity cube, say C,, rather than a single value. The similarity cube C, can be 
dimensioned differently from cubes C, and Q being compared. Each cell of C, repre- 
sents the similarity of a pair of corresponding sub-cubes (slices or dices) of Cj and Q. 
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To support such cube similarity comparison, the following should be provided. 

□ The mapping from a cell of Qto a pair of corresponding sub-cubes of Q and Q. 

□ The algebraic structure for summarizing cell-wise comparison results of a pair of 
sub-cubes to a single similarity measure to be stored in one cell of Q. 

For the latter, we have introduced the following two approaches. One treats a sub- 
cube as a bag, and summarizes cell-wise comparison results based on bag overlap. 

The other treats a sub-cube as a vector, and summarizes cell-wise comparison results 
based on vector distance. 

Bag-overlap based approach is primarily used for comparing volume-based cubes, 
while vector-distance based approach can be used for comparing both volume-based 
and probability-based cubes. The similarity of volume-based calling patterns is mean- 
ingful only when they cover the same time-span. This limitation can be eliminated in 
measuring the similarity of probability-based calling patterns. This is especially useful 
in comparing a preset calling pattern with an ongoing one in real-time. For example, 
the following cube measures the similarity of probability-based outgoing calling pat- 
terns 

define PI SIM.o variable decimal <durLevel, timeLevel, dowLevel> inplace 
An instance of PlSIM.o is illustrated in Figure 2. 

DOWLEVEL: week 

PlSIM.o 

DURLEVEL 

TIMELEVEL dur_all dur_bin 

time_all 1.00 0.71 

time bin 0.94 0.70 



DOWLEVEL : ww 

PlSIM.o 

DURLEVEL 

TIMELEVEL dur all dur bin 



time_all 0.95 0.72 

time bin 0.92 0.71 



DOWLEVEL: dd 

PlSIM.o 

DURLEVEL 

TIMELEVEL dur all dur bin 



time_all 0.77 0.63 

time_bin 0.73 0.61 

Fig. 2. PlSIM.o: Multilevel and multidimensional similarity cube 

PlSIM.o is calculated by comparing two probability-based calling pattern cubes 
based on vector-distance using a cell-to-subcube mapping. It takes two calling pattern 
cubes, Pl_CB.o and Pl_CB2.o (defined in the same way as Pl_CB.o) as input (since 
Pl_CB.o and Pl_CB2.o are “views” of CB.o and CB2.o, the latter can also be consid- 
ered input cubes). 
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Let us look at the following two cells that are based on the same dimension values 
as the cells shown in the above PlSIM.o examples. 

□ Cell PlSIM.o(durLevel ‘durjbin’, timeLevel ‘time_bin’, dowLevel ‘dd’) says that 
there is 61% similarity between a corresponding pair of probability-based sub- 
cubes of Pl_CB.o and Pl_CB2.o. These sub-cubes are based on low-level values of 
dimension duration, time, dow and all values of dimension callee. The value of the 
above cell is the vector-based summarization of cell-wise comparison of the above 
pair of sub-cubes. 

□ Ceil PI SIM. o(durLevel ‘dur_alV, timeLevel ‘time_alV, dowLevel ‘week’) says that 
there is 100% similarity of a pair of sub-cubes of Pl_CB.o and Pl_CB2.o that are 
based on high-level values of dimension duration, time and dow, and all values of 
dimension callee. 

For details of multidimensional calling pattern similarity comparison, see [6]. 



5 Conclusions 

The problem of customer behavior profiling occurs in many applications such as tele- 
communications and electronic commerce. In this paper we have developed a data- 
warehouse and OLAP based framework for customer behavior profiling, and illus- 
trated its use in a telecommunication application. A prototype has been implemented 
at HP Labs. Our work demonstrates the practical value of using OLAP server as a 
scalable computation engine for creating and updating profiles, deriving calling pat- 
terns from profiles, as well as analyzing and comparing calling patterns. We plan to 
introduce parallel data warehousing and OLAP architecture to further scale the pro- 
filing engine. 
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Abstract. A number of techniques have been proposed in the litera- 
ture to optimize the querying of datacnbes (i.e., matrix representations 
of multidimensional relations) in OLAP applications. In this paper we are 
concerned with the problem of providing very fast executions of range 
queries on datacubes by possibly returning ’approximate’ answers. To 
this end, given a large datacube with non-negative values for the mea- 
sure attribute, we propose to divide the datacube into blocks of possibly 
different sizes and to store a number of aggregate data for each of them 
(number of tuples occurring in the block, the sum of all measure values, 
minimum and maximum values). Then, when a range query (in particu- 
lar, count and sum) is issued, we compute the answer on the aggregate 
data rather than on the actual tuples, thus returning ’approximated’ re- 
sults. We introduce a number of techniques to perform an estimation 
(with expected value and variance) of range query answers and compare 
the accuracies of their estimations. We finally present a comparative 
analysis with other recently proposed techniques; the results confirm the 
effectiveness of our approach. 



1 Introduction 

On-Line Analytical Processing (OLAP) is a recent querying paradigm which 
deals with aggregate databases [5, 16, 4, 2]. An important data model for OLAP 
is the multidimensional relation that consists of a number of functional attributes 
(also called dimensions) and one or more measure attributes [1, 13]. The dimen- 
sion attributes are a key for the relation so that there are no two tuples with the 
same dimension value. Therefore a multidatabase relation can be seen as a mul- 
tidimensional matrix of the measure values, called datacube [7]. Some elements 
of the matrix are null as their dimension values do not occur in the relation. 

A range query on a datacube is an aggregation query over a given dimension 
range (a cube). Typical queries are count, sum, max, min, and average. Since 
the size of datacubes may be very large, most work in the literature is devoted 
to improve the performances of such range queries [1, 9, 11, 18]. 

Mukesh Mohania and A Min Tjoa (Eds.): DaWaK’99, LNCS 1676, pp. 65-77, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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In this paper we aim at providing very high performances in answering range 
queries (in particular, count and sum queries) by possibly paying the price of 
introducing some approximation in the results. To this end we propose to run a 
range query over a compressed representation of the datacube, that is, a parti- 
tion of the datacube into blocks of possibly different sizes storing a number of 
aggregate data for each block. This approach is very useful when the user wants 
to have fast answers without being forced to wait a long time to get a precision 
which often is not necessary. In any case, approximated results will come with 
a detailed analysis of the possible error so that, if the user is not satisfied with 
the obtained precision, s/he may eventually decide to submit the query on the 
actual datacube. In this case, it is not necessary to run the query over all tuples 
but only on those portions of the range that do not fit the blocks. 

Our approach is probabilistic in the sense that, given a range query (say, a 
count or a sum query) over a datacube M, we issue a query over a random dat- 
acube variable (say, for the count query and for the sum query) ranging 
over the population of all datacubes having the same compressed representation 
as M. The query answers are then random variables; so they will be described 
by an expected value and a variance. To achieve better estimations, we need 
to restrict the range for the random datacube variable by a suitable usage of 
the following aggregate data: the number of tuples (i.e., non-null elements), the 
total sum of the measure values, the minimum and the maximum values. We 
shall present 4 different techniques (called cases) which make a different usage 
of aggregate data: 

1. (resp., M^) ranges over the set of all datacubes having the same count 
(resp., sum) aggregate data as M — thus the two queries are solved sepa- 
rately using the minimum amount of the available aggregate information; 

2. and range over the same set of datacubes: the ones which have the 
same count and sum aggregate data as M — therefore, as they use the same 
information, the two queries are solved together; 

3. both and range over the set of all datacubes which have all the same 
aggregate data as M (i.e., count, sum, min and max) — , as for the case 2, 
the two queries are solved together; 

4. the two queries are solved separately as in the Case 1 but while the range of 

does not change at all, the range of is restricted to those databases 
which also have the same max aggregate data as M — note that the infor- 
mation about min cannot be used as null values are not treated explicitly 
and, then, they must be assimilated to zero. 

Our amalysis shows that the 4 cases give the same estimation for the answers 
(corresponding to a simple interpolation on the basis of the size of the query 
range w.r.t. the size of a block) for both queries and even the same error for the 
count query. On the other hand, they return different errors for the sum query. 
A theoretical analysis to determine which case returns the smallest error is not 
an easy task so we performed some experiments. As expected. Case 3 worked 
better than Case 2 but, surprisingly. Case 1 introduced smaller errors than Case 
2 and Case 4 behaved better than Case 3. Thus Case 4 seems to be the best 
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estimator. We can therefore draw two conclusions: (i) for the count query, any 
additional information is useless for it does not change the estimation and (ii) 
for the sum query, the aggregate count information is even ’dangerous’ for it 
enlarges the estimation errors. 

The paper is organized as follows. In Section 2 we discuss related work and 
in Section 3 we introduce the compressed representation of a datacube. In Sec- 
tion 4 we fix the probabilistic framework for estimating count and sum range 
queries on a datacube M by means of aggregate data and then we provide query 
estimations for the above four cases. The experiments for determining the case 
with the smallest error are reported in Section 5. In Section 5.2 we compare 
our estimations for the sum query with the ones obtained by the recent, promis- 
ing technique presented in [3] . Our experiments show that both approaches work 
equally fine in general but our results are definitely better when the values in the 
datacubes have same distorsions like the presence of null elements and different 
distributions of data for rows or columns. 

2 Related Work 

The idea of using aggregate data to improve performances in answering range 
queries was first proposed in [11]. To get very fast answers (e.g., running in 
constant time), a large amount of auxiliary information (e.g., as large as the size 
of the datacube) is required. When a reduced size of additional information is 
stored, the performances very much reduce as it is now necessary to access part of 
the datacube to reconstruct the entire answer: the smaller is the size, the higher 
is the number of original tuples which must be consulted. The tradeoff with our 
approach can be stated as follows. Both approaches aim to get fast answers: [11] 
invests in large space resources while we reduce our level of ambition about the 
quality (in terms of precision) of the answers. 

The possibility of returning approximate answers has been exploited also in 
[10] but, in that case, the approximation is temporary since results are output 
on the fly while the tuples are being scanned and, at the end, after all tuples will 
be consulted, the user will eventually get the correct answer. In our case, the 
correct answer will never received but, on the other hand, the actual datacube 
will not accessed either so that we may eventually get higher performances. 

The problem of estimating detail data from summarized ones has been stud- 
ied in [6] and interesting results have been obtained by requiring the optimization 
of some criterion like the smoothness of the distribution of values. Our approach 
differs because of the particular criterion adopted (enforcing constraints on vari- 
ous aggregate data) , of the fact that no regular data distribution is assumed and 
of our emphasis on the error estimation. 

Histograms have been used since long time to summarize the contents of a 
database relation mainly to support query optimization [15]. A histogram stores 
the number of occurrences for groups of values of a given domain: from the 
aggregate data of a group, one has to estimate the number of occurrences of a 
given single value or subset of the group. A deal of renewed interest has been 
recently put on this topic, mainly to discover the best way of dividing the range of 
values into groups [12, 17, 14]. Such techniques could be applied in the design of 
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the blocks in our approach although histograms are typically monodimensional 
and straightforward extentions to the multidimensional case do no seem to be 
feasible. 

A recent interesting proposal of providing a succint description of a datacube 
and estimating the actual data with a certain level of accuracy is reported in [3] . 
Data compression is based on storing suitable coefficients of interpolation lines 
rather than aggregate data as for our approach. In Section 2 we shall compare 
the proposal of [3] with ours. 

3 Compressed Representations of Datacubes 

Let i = <ii, . . . ,ir> and j = <ji, . . . ,jr> be two r-tuples of cardinals, with 
r > 0. We extend common operators for cardinals to tuples in the natural way: 
i < j means that ii < ji , . . . < jV; i+j denotes the tuple <ii + ji, . . . , ir + jr> 

and so on. Given a cardinal p and r > 0, p'’ (or simply p, if r is understood) 
denotes the r-tuple of all p. Finally, [i..j] denotes the range of all tuples q for 
which i < q < j . 

A multidimensional relation i? is a relation whose scheme consists of r > 0 
junetional attributes (i.e., r dimensions) and s > 0 measure attributes. The 
functional attributes are a key for the relation so that there are no two tuples 
with the same dimension value. For the sake of presentation but without loss 
of generality, we assume that (i) s = 1 and the domain of the unique measure 
attribute is the range [0..w] =, where w > 0, and (ii) r > 1 and the domain 
of each functional attribute q, 1 < q < r, is the range [l..rig], where Uq > 2, 
i.e., the projection of R on the functional attributes is in the range [l..n], where 
n = <rii, . . . ,Hr>. 

We consider the following range queries on R-. given any range [i..j] with 
1 < i < j < n, (i) eount query: count^' "^\R) denotes the number of tuples of R 
whose dimension values are in and (ii) sum query: sum}' "^\R) denotes the 
sum of all measure values for those tuples of R whose dimension values are in 

[i-j]- 

Since the dimension attributes are a key, the relation R can be naturally 
viewed as a [l..n] matrix (i.e., a dataeube) M of elements with values in [0..w] 
U{e} such that for each i, 1 < i < n, M[i] = v £ [0..w] if the tuple <i,u> is in R 
or otherwise M[i] = e — so, as in the latter case no tuple with dimension value 
i is present in R, e stands for null element. The above range queries can be then 
reformulated in terms of array operations as follows: (i) count {M[i..j;]) = |{q : 
i < q < j and M[q] ^ e}|; (ii) sum(M[i..j]) = where e yields the 

value 0 in the summation. 

We now introduce a eompressed representation of the relation R by dividing 
the datacube M into a number of blocks and by storing a number of aggregate 
data for each of them. To this end, given m = <mi , . . . , mr> ior which 1 < m < 
n, an m-eompression faetor for M is a set F = {/i , . . . , /^}, such that for each q, 
0 < g < r, /, is a [1 : m,] array for which 0 = /,[0] < /,[!] < • • • < fq[niq] = Uq. 
F determines the following toi x • • • xrur blocks: for each tuple k in [1 : m], the 
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block with index k is the submatrix of M ranging from F(k — 1) + 1 to F(k), 
where F(k) denotes the tuple </i[A:i], . . . , fr[kr]>- The size (i.e., the number of 
elements) of a block k is (/i[A:i] — /i[A:i — 1]) x • • • x{fr[kr] — fr[kr — !])• 

For instance, consider the [<1,1 > ..<10, 6>] matrix M in Figure l.(a), 
which is divided into 6 blocks as indicated by the double lines. We have that 
m = <3,2>, /i[0] = 0, /i[l] = 3, /i[2] = 7, /i[3] = 10, and h[0] = 0, /2[1] = 4, 
/2[2] = 6. The block <1, 1> has size 3x2 and range [<1, 1>..<3,4>]; the block 
<1,2> has size 3x2 and range [<1, 5>..<3, 6>], and so on. 
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(8,26,5,1)11(5,29,9,5) 



(7,18,5,1)11(7,31,^ 



(4,3,1,0) 11(6,23,^ 



(b) 



Fig. 1. A bidimensional datacube and its compressed representation 



A compressed representation of the datacube M consists of selecting an m- 
compression factor F and storing the following aggregate data on the F-blocks 
of M: (i) the [l..m] datacubes Mcount,F and Msum,F such that for each k € 
= cs{M[F\k — 1] + l..F[k]]), where cs stands for count or sum] 
(ii) the [l..m] datacubes Mmax,F and Mmin,F such that for each k € [l..m], 
= 0 if all elements in the block k are null, or otherwise Mmm,F[i^] = 
TOTO(ir[k_i]+i<q<ir[k])AM[q]7^e(T7 [q]), where mm stands for max or min. 

The compressed representation of the datacube M in Figure l.(a) is repre- 
sented in Figure l.(b) by a matrix of 4-tuples, one for each block. Every 4-tuple 
indicates respectively the number of non-null elements, the sum of the elements, 
the max and the min in the block. For instance, the block <1,1> has 8 non- 
null elements with sum 26, max 5 and min 1; the block <1,2> has 5 non-null 
elements with sum 29, max 9, min 5 and so on. 

4 Estimation of Range Queries over Compressed 
Datacubes 

4.1 The Probabilistic Framework 

We next introduce a probabilistic framework for estimating the answers of range 
queries (sum and count) by consulting aggregate data rather than the actual 





70 



F. Buccafurri, D. Rosaci, and D. Sacca 



datacube. For such estimations we make the queries random by replacing the 
datacube M with a random datacube variable ranging over the population of 
all datacubes which have the same aggregate data as M. We have the following 
datacube populations: for each agg = count, sum, max, min, f 
of all [l..n] matrix M' of elements in [0..w] U{e} for which p = Magg,F- 
Let the queries count(M[i..j]) and sum(M[i..j]) be given. We shall estimate 
the two queries by the expected value E and the variance a of count{M^[\..S\) and 
sum{M^[\..S\), respectively, where and are random datacube variables. 
We shall consider various ranges for the two random variables; in particular, we 
shall analyze the following 4 different cases: 

1. ranges OYev M~l^pp and M" over M;~^^ p- 

2. both and MJ range over M~l^pp n 

3. both and M" range over M^\^ p n M-]^ p n M;^^ p n M~l^pp-, 

4. ranges over M~l^pp and M" ranges over M~l^ p n M~J^ p. 

Given query(M[i..j]), where query stands for count or sum, due to the lin- 
earity of the expected value (operator E) we have: 

E{query{M[\..i]) = ^ E {query + X! [q] 

kGPBF(iJ) qGTBF(iJ) 

where TBp{\,'^) returns the set of block indices k that are totally contained in 
the range [i : j], (i.e., both i < F(k — 1) -F 1 and F’(k) < j), PBp{\,'^) returns 
the set of block indices k that are partially inside the range, and i* and j* 
are the boundaries of the portion of a block k € PBp{\,'^) which overlaps the 
range [i : j]. For example, for the datacube in Figure l.(a), given i = <4,3> 
and j = <8,6>, the block <2,2> is totally contained in the range, the blocks 
<2, 1>, <3, 1>, <3,2> are partially contained in the range (with boundaries 
[<4,3>..<7,4>], [<8, 3>..<8, 4>] and [<8, 5>..<8, 6>], respectively), and the 
blocks <1, 1>, <1,2> are outside the range. 

Concerning the variance, we assume statistical independence between the 
values of different blocks so that its value is determined by just summing the 
variances of all partially overlapped blocks without having to introduce any co- 
variance, that is <T^(guery(M[i..j]) = 'I2kePBp(ij) '^^(Q'^6ry(M[ii,..j;i,])). It turns 
out that we only need to study the estimation of a query ranging on one partial 
block as all other cases can be easily recomposed from this basic case. 

From now on, we assume that the query range [i..j] is strictly inside one single 
block, say the block k, i.e., F[k — l] + l<i<j< -F[k]. Let b be the size of the 
block k, d {0 < d < b) he the size of elements in the range [i..j], t = Mcount,F[k] 
be the number of non-null elements in the block k, c = Msum,F{k] be the sum 
of the elements in the block, I = Mmin,F[k] and u = Mmax,F[k] be the min and 
the max measure values in the block. 

4.2 Estimating the count and sum query separately — Case 1 

In the first case we estimate separately the two range query count {M[i..j]) and 
sum{M[i..j]) using their corresponding aggregate data. 
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Theorem 1. ^ Given the random dataeube variable over p and the 

random dataeube variable over F’ (V the probability that the number 

of non-null elements of in the range be h (for eaeh h, 0 < h < t) and 
(2) the probability that the sum of the elements of in the range be s (for 
eaeh s, 0 < s < c) are equal to, respeetively: 



Pi(h) = 



b-d 
t — h 



and Pi(s) = 



d + s — 1 
s 



b — d-\-c — s — 1 
c — s 



6 + c — 1 

c 



We next compute the expected values and variances of the two random queries: 



Proposition 2. Let range overM^J^^^p and range over p. Then: 

E{count{M^[\..'i\)) = {djb) ■ t, a“^ (count(M'^[i. .j])) = t • (b — t) • d • 55 ^ 5 ^^, and 
E{sum{M^[i..j])) = (d/b) ■ c, a‘^{sum{M^[i..j])) = E«=o(((^/^) 'C-s)^ •-F’i(s))- 



4.3 Estimating connt and snm qneries together — Case 2 

The case 2 estimates both range queries, count and sum{M[i..j]), at one 
time, using the information about count and sum aggregate data. 

Theorems. Given the random dataeube variable M over p n p, 

for eaeh s, 0 < s < c, and for eaeh h, 0 < h < t, the probability that both 
the sum and the number of all non-null elements in the range [i..j] be s and h, 
respeetively, is: 



{ 0 if X < y or 

^t/ + 2: - 1 j 

Proposition 4. LetM range over MJ pHUj p. Then 

E{sum{M[i..i])) = {d/b)-c, {sum{M[i..i])) = Y.l^o{{{dlb)-c-sf -Y!h^^P 2 {h,s)) , 

E {count {M[\..Tf\)) = {d/b) • t, 0 “^ {count {M[\. .'}])) = t • {b — t) • d • 

The above proposition says that the count query does not need the additional 
information on the sum of the elements in a block to perform a good estimation. 

A different, surprising observation holds for the sum query: as discussed later in 
the paper, the knowledge of the number of non-null elements seems to reduce 
the accuracy of the estimation. 



® For space reasons the proofs of theorems and propositions are omitted. 
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4.4 Estimating count and sum queries together using min and max 
— Case 3 

We now exploit the information about the minimum and the maximum value 
for each block, stored in the two [l..m] matrices Mmin,F and Mmax,F- As in the 
case 2, we estimate both range queries, count(M[i..j]) and sum(M[i..j]) at one 
time using the same datacube range. 

Theorems. Given the random dataeube variable M ranging over ^ fl 

^mL,F n for eaeh s,0<s<c, and for eaeh h,0<h<t, 

the probability that both the sum and the number of all non-null elements in the 
range [i..j] be s and h, respeetively, is: Pz{h,s) = ^ ^ 

N(x,y,z) is equal to: 

0 ii y ■ u < z or X < y, or otherwise 

f y + z-y-l-j-{u-\-l)-l 

>7=0 i ' (y-^)\-max(l,^) 1 “ 1 

and u' = u — 1. 

Proposition 6. LetM range over Then: 

E{sum{M[i..i])) = {d/b)-c, {sum{M[l.i])) = E^=o(( W^)-c-s)^'ELo s)), 

E(ccmnt(M[i..j])) = (d/b) • t, a“^ (ccmnt(M[i. .j])) = t • (b — t) • d • 




4.5 Estimating sum queries without using count — Case 4 

In this case we do not take into account the knowledge about the null elements 
that will be considered as elements with value 0. We shall use the sum and the 
max: the min will be used only when there ara no null elements in the block 
otherwise the min value becomes 0. 

Theorem 7. Given the random dataeube variable M ranging over p n 

M~ln,F n ift = 0or over M~l^ p n p otherwise, for eaeh s, 

0 < s < c, and for eaeh h, 0 < h < t, the probability that the sum of all elements 
in the range [i..j] be s, is: P 4 {s) = ^ jg/jg^g L{^x,z) is equal to: 

{ 0 it y ■ u < z, or otherwise 

/ x\ J) j<! . { x-i-z-yl' -'■/■{u + l)-l\ 

\y ) V ' (j/-7)!maa:(l,7) I X—\ ) 

I' = I if t = 0 or 0 otherwise, and u' = u — I' . 

Propositions. Let M range over Mj^^ p n Mj^ p n Mj^ p if t = 0 or over 
^mlx,F n M~J^ p Otherwise. Then: 

E{sum{M[i..i])) = (dlb) ■ c, a‘^{sum{M[i..i])) = E«=o(((^/^) ' c - s)^ • ^ 4 ( 5 ))- 
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5 Comparative Analysis 

5.1 Comparison of the 4 Cases 

Let us first consider the count query. We have that the 4 cases give the same 
results, meaning that for this query the only useful information is the number of 
non-null elements for each block. Although we cannot compare the 4 cases, we 
can make some general remarks on the estimation, particularly because we have 
a closed formula also for the error. Our first observation regards the density of 
the block, defined as <5 = t/6. Considering constants b and d, the error gets the 
maximum value for 6 = It monotonically and symmetrically decreases down 
to 0 by moving the value of <5 from ^ to 1 and from i to 0, respectively. This 
can be intuitively explained by the fact that, the estimation of the count query 
corresponds to guessing the number of Is in t extractions of a binary variable 
in a sample set composed of b bits, with probability of finding 1 equal to d/b. 
Clearly, S = ^ represents the highest degree of uncertainty; on the contrary, for <5 
close to 1 (or to 0) , the number of different configurations producing this density 
is small. Similar results can be obtained by considering the size d of the range 
as the only variable. In this case, the maximum error occurs for d = | while the 
error is 0 for d = 0 and d = b. Also this result is quite intuitive. Note that, in case 
the block is sufficiently dense (or sufficiently sparse), the error takes very small 
values. To give an idea, consider the case of 6 = 10000, t = 5000 and d = |: the 
error is 25. For a more dense block, for instance with t = 7500, a count query 
with size d = 7500 leads an error of 11,25. We got very small errors also in the 
experimental results reported later in this section. 

Another intuitive observation regards the size b of the block. It is easy to see, 
that the error is monotonically increasing with b, for a fixed density <5 and a fixed 
d > 1. As expected, larger blocks in general imply higher loss of information. 

Let us compare the 4 cases w.r.t. the sum query. They give the same esti- 
mation of the sum but different values for the error. The best method is then 
the one with computes the smallest error. The formulas are not easy to read 
and, therefore, we cannot characterize the behavior of the 4 cases on the basis of 
theoretical observations. We need to carry out a number of experiments. So we 
have taken a datacube block of 100 elements and we have randomly generated 
the values (including the null elements) for 10 block instances. Then for each 
instance we have computed a number of sum queries using different range sizes 
and we have compared the actual error (w.r.t. the estimated answer) and the 
errors computed by the 4 cases. The results confirm that all 4 cases provide a 
good estimation as the actual answer S is always inside the interval S* ±2- Eg. 
Surprisingly (at least for us). Case 1 computes smaller errors than Case 2 and 
Case 4 does the same w.r.t. Case 3. For example, given one of datacubes utilized 
in experiments and a range query with bounds [1..20] with actual value for the 
sum of 266 and actual value for count of 19, the estimated sum was 248.6 and 
the estimated count was 17: so, their respective actual errors were 17.4 and 2. 

* The error evaluated for a density 5 coincides with that evaluated for the density 1 — 
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Table 1. Experimental results: comparison of the 4 cases 



The estimated errors were: 52 for Case 1, 61 for Case 2, 40 for Case 3 and 27 for 
Case 4. The estimated error for the count query coincided with the actual one 
(2). We obtained a similar behavior for the estimated errors also for the other 
range queries (some of these results are reported in Table 1). On the basis of the 
results of our experiments, we argue that the usage of count aggregate data is 
’dangerous’ as it increases the estimated error. On the other hand, the knowl- 
edge about max and min is useful since both Cases 3 and 4 compute smaller 
errors than Cases 1 and 2. Case 4 seems to be the best method but additional 
experiments are probably necessary before definitely assigning this title to it. In 
case one decides to use Case 4, then it is possible to save same space as follows. 
Since the min value is used in Case 4 only when there are no null elements, it 
is possible to use the same word for both min and count aggregate data: a 1-bit 
flag will then state which information is being stored in the word. 



5.2 Comparison with Barbara&Sullivan’s method 

Barbara&Sullivan have proposed in [3] a particolar method (called quasi-cubes) 
to compress a bidimensional (n x to) datacube by storing the sum of values for 
each row i (suitably converted into probabilities r,) and by constructing a re- 
gression line for each column j of the form py = Aj ■ r, + Bj so that the element 
(i,j) is estimated by the value Py + Sij ■ Sj, where Sj is the average error of 
the regression and Sy values ±1 according to the position of the actual element 
(i,j) w.r.t. the regression line. In sum, the quasi-cubes require the following 
data: n values r, for the rows, to tuples (Aj,Bj,Sj) for the columns, n xm bit 
for the sign Sy and the total sum of all elements. It is also proposed to retain 
some actual values to improve the estimation but this will obviously increase the 
size of the compression. To compare the Barbara&Sullivan’s method with our 
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approach, we have considered several datacubes of size 100 x 100, we have run 
several queries on each datacube and we have evaluated the correct answers and 
computed the ones determined by the two methods using the same compression 
factor of 15 (this means blocks of around 50 elements for our method). Since the 
Barbara&Sullivan’s method does not provide any estimation error, the compar- 
ison has been made only w.r.t. the estimated values. Figure 2. (a) describes the 
actual errors of the two methods for datacubes with values generated according 
to a gaussian distribution with average E = 200 and deviation cr = 50. We have 
used dotted lines for the results of Barbara&Sullivan’s method and full lines for 
our results. The reported errors refer to various queries that are ordered on the 
X axis according the ratio ” size of the query range / 50” . It turns out that both 
methods yield very good results but our method behaves better in general except 
when the query range is very small. The fact that Barbara&Sullivan’s method 
works better for small ranges is not surprising as the original goal of that method 
was to estimate single elements. We have obtained similar results by changing 
the variance of the distribution to the following values: 25, 75 and 100. Indeed 
our method works a bit better with larger variances. In particular, while average 
errors of two methods was the same (0.14) with variance 25, the avarage error of 
Quasi Cubes was 0.84 compared to 0.58 of our method. In the experiments re- 
ported in Figure 2.(b), we have modified the datacube generated with the above 
gaussian distribution (average 200 and variance 50) by inserting around 25% of 
null elements in a random way. The estimations of Barbara&Sullivan’s method 
become much worse than ours: the distorsion of the nulls is partially absorbed 
by our block structure. The avarage error of our estimation is 1.93, while that 
of Quasi Cubes is 16.03. In this cases Barbara&Sullivan’s method suggests to 
retain the positions of the nulls; but, then, the compression factor changes and 
the comparison becomes more difficult. Subsequent experiments have confirmed 
that our method works much better when some distorsions are added to a gaus- 
sian value distribution. Indeed in the experiments represented in Figures 2.(c) 
2.(d) and 2.(e). we have used a different gaussian distribution for each row i 
according to the following laws: 

E{i) = 50 -F 1.5 • i; a(i) = 10 -F 0.5 • i (Fig. 2.(c)) 

E{i^ = 50 -F 3 • = 10 -F i (Fig. 2.(d)) 

E{i) = 50 -F 6 • a{i) = 10 -F 2 • i (Fig. 2.(e)) 

Similar results, shown in Figure 2.(f), were obtained by making the gaussian 
distribution vary for each column i according to the same law of Figure 2.(c). 
In these cases the gap between avarage errors of two estimations was sensible. 
For instance, for experiments reported in Figure 2.(e) the avarage error of the 
Barbara&Sullivan’s method was 18.8 while that of our method was 1.38. 
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Abstract. Data Warehousing requires effective methods for processing and 
storing large amounts of data. OLAP applications form an additional tier in the 
data warehouse architecture and in order to interact acceptably with the user, 
typically data pre-computation is required. In such a case compressed represe- 
ntations have the potential to improve storage and processing efficiency. This 
paper proposes a compressed database system which aims to provide an 
effective storage model. We show that in several other stages of the Data 
Warehouse architecture compression can also be employed. Novel systems 
engineering is adopted to ensure that compression/decompression overheads are 
limited, and that data reorganisations are of controlled complexity and can be 
carried out incrementally. The basic architecture is described and experimental 
results on the TPC-D and other datasets show the performance of our system. 



1 Introduction 

The Relational Model [1] proved a sound base for database systems and has become 
the dominant standard approach to operational databases or On Line Transaction 
Processing (OLTP). The requirements of the On Line Analytical Processing (OLAP) 
are distinctly different from those of OLTP [2], [3]. Data Warehouses are integrated 
databases providing collected information from various heterogeneous information 
sources [4], while OLAP applications requiring aggregation in several dimensions. 
Multidimensional aggregation is a complex process and to provide access in real- 
time, materialized views are employed. These are pre-computed and stored sets of 
aggregated values and may take up large volumes of storage [27], [5]. Several resea- 
rchers have proposed techniques for selection only subsets for materialization 
[6],[V],[8],[28]. However, in the existence of hierarchies in dimensions, the storage 
requirement is even higher than for a single level domain. Thus materialized views 
provide at best a partial solution in the Data Warehouse environment [9]. There is 
thus a need for a new generation of a DBMS able to achieve orders of magnitude 
improvements in performance. 

Compression has been used to a limited extend on OLTP databases [13], [14], [15]. 
In the OLAP environment applications are less mature but compression through bit 
indexing has been applied in some of the systems [38], [16]. The advantages of data 
compression apply widely in database architecture, permitting more rapid processing 
more effective I/O and better communications performance [17]. In main memory 
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databases for high performance applications a typical approach to data representation 
is to characterize domain values as a sequence of strings. Tuples are represented as a 
set of fixed length pointers or tokens referring to the corresponding domain values 
[18]. This approach provides significant compression but leaves each domain value 
represented by a fixed length pointer/token, typically a machine word in length. An 
alternative strategy is to store a fixed length reference number for each domain value 
instead of a pointer to the value [18], [19]. An optimal strategy for generating lexical 
tokens is described by Wang and Lavington [20]. Despite the considerable research 
into memory resident databases [10], [11], [12], [21], [39] most current DBMS are built 
around disk resident data. The balance is between the greater compactness of storage, 
with the greater efficiency of some storage operations, versus the greater software 
complexity and processing overheads of de-tokenizing, particularly if the symbol 
table is backing-store resident. The Peterlee Relational Test Vehicle PRTV - an early 
relational database prototype [22] applied two-dimensional compression by sorting 
similar records into adjacent locations and storing only a flag bit per field, with the 
differences. However such methods result in variable length tuples, destroying the 
simple addressability of fixed-length tuples. Compression/decompression may then 
become the dominant element in processing. Common compression techniques like 
Huffman [23], or LZW coding [24], are unsuitable for processing in databases 
because we require relational operations to work on the compressed data as well as 
random access to the rows of the table. The method of representing the relations in a 
compressed format must be such that the system is efficient in each of its modes of 
processing (e.g. loading, update retrieval). 



2 Applying Compression in Data Warehousing 

The Data Warehouse concept can be found in [25], [4], [3]. There are four stages, 

in the DW configuration, where compression can be applied: 

• Information Source to the Data Warehouse transfer: This is an in-advance 
approach, and the encode/decode process is not time critical. Compressed data 
use less bandwidth and can thus be sent faster across communications networks. 
Compression thus facilitates loading and update of the Data Warehouse. 

• The Data Warehouse query stage; Compression can result on average in X8 
reduction in data volume in practice [26] while [17] observed processing X6 
faster than with uncompressed data. For disk-based data a compressed 
representation reduces the I/O traffic, while more information can be retained in 
cache or main memory. For main memory databases, the benefit is retaining the 
faster RAM performance while reducing the cost of the necessary RAM storage. 
In Hibase, [26] where the compressed factor is at least eight, the cost of RAM is a 
relatively insignificant factor. 

• Distributed database systems (Data Marts) : These systems are used as depart- 
mental subsets focused on selected subjects of users' interests [CD97]. Our 
system can potentially be used for caching. Data Marts are usually much smaller 
in size than the DW database. 

• Back-up storage aud Recovery : When the addressabilty of the individual tuples 
need not be preserved a further reduction in data volume can be achieved by 
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applying a secondary lossless compression algorithm prior to archiving and 
transmitting. 

• Client-server environment ; Potentially the compressed representation could be 
utilized in a client-server architecture. The dictionary could be distributed to 
every client and the compressed representation would be used for data transfer 
and querying. The volume cost of the dictionary is not significant considering the 
volume of the views when they are fully materialized. Table 1 shows the volume 
of the dictionary compared to the raw data and the materialized views for the 
TPC-D benchmark dataset [36], table Lineitem (scale factor 0.001). 





Data Volume 


Dictionary 


1 MB 


Raw data 


3.97 MB 


Materialized views 


3.7 GB 



Table 1. Compression overhead compare to the Materialized views volume (Bytes) 

3 Compressed Relations 

Table 2 has a single relation that describes the characteristics of a number of chips. 
If this is stored as a file of records, we have to make each field of these records big 
enough to hold the largest value that any of the records stores in that field. Since, is 
not likely to know exactly how large the individual values may be, the tendency will 
be to err on the side of caution and make the fields longer than is strictly necessary. In 
Hibase [26], First Order Compression reduces each field to an integer containing just 
sufficient bits to encode all the values that occur within the domain of that field in the 
database. In Table 2 since there are seven distinct part names, the parts column of the 
relation could be represented as a list of 3 bit numbers. Similarly, since there are only 
3 distinct values for the number of pins, these could be represented as 2 bit numbers. 
Once similar compression has been applied to the other fields, the resulting tuples are 
found to be only 7 bits long (Table 2). Now the relation occupies only 49 bits. Our 
compression mechanism converts the values in each column of a relational table into 
short integers, using a dictionary that is unique to the domain of each column. The 
compressed code for a field value is given by its position (i.e. subscript) in the 
corresponding dictionary which is used for fields containing text. Numeric fields are 
coded directly as variable length binary numbers and do not require the dictionary 
mechanism. 
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Table 2. The CHIPSEC database contents and its compressed representation 
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3.1 Data Structure for Relations 



In the conventional horizontal data organization, the tuples of a relation are stored 
as records with fields adjacent o one another. In disk resident database systems the 
horizontal organization is more popular. It has the advantage that a single disk access 
fetches all fields of a tuple. Even for non-compressed RAM databases, the 
implementational simplicity of this approach commends it. It is equally valid 
however, to think of the relation as a sequence of columns, and implement these 
columns as vectors so that corresponding fields in successive tuples are stored 
adjacently- vertical data organization as shown in Figure 1. A compressed data 
representation imposes constraints, which favour a vertical organization. 
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Fig. 1. A relation represented in a columnar format 

Unlike a conventional database system, the field widths are not known when the 
schema is created. This is particularly the case when the data of the relation is being 
extended by tuple at a time insertion or update. The number of bits required to 
represent the domain of a field is proportional to the logarithm of the domain's current 
cardinality. As data is loaded the cardinality grows, and with it the number of bits 
required to represent a token. It is necessary for the data structures, which represent 
the relation, to be able to dynamically alter to accommodate varying tuple widths and 
in particular to do so incrementally with known overheads of limited and predictable 
extend. For example, suppose we have a field (e.g. the Chipname Field in our 
examples) that requires 3 bits to encode its domain. If the number of unique strings 
used in the domain rises above 8, we will need 4 bits to represent them in encoded 
form. An extended version of the Hibase’s description can be found [26]. 

3.2 Querying a Compressed Database 

An essential feature of our approach is that operations are applied directly to 
datasets in their compressed format. Before a query is posed, is translated into the 
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compressed domain, thus the amount of data that has to he moved into the CPU and 
processed is significantly reduced. The final answer has to he converted to the 
uncompressed representation. The computational cost of this decompression will be 
home only by the tuples that are returned in the result, and normally a small fraction 
of those processed. The relational representation also includes compressed indexes to 
facilitate rapid access to required records. Dictionary and Indexing techniques will not 
be described in this paper but are used in the Hibase prototype and are partly 
responsible for its good performance. 



4. Experimental Results 

The following section compares the results obtained by the prototype Hibase 
system with those of other systems. Comparisons are given of the volumes of data 
storage required. Data compression depends on the natural occurrence frequencies of 
different elements of the data. Two data sets were used, the TPC-D benchmark dataset 
[37] and a real-life one derived from Telecommunications data. Performance compa- 
risons are given for standard benchmark operations on a standard international 
database dataset (the Wisconsin benchmark). In this case we have given timings for 
the same datasets and operations as have been used by other workers in the field. 



4.1 Storage Performance 

The TPC-D benchmark dataset was used to show the storage savings of the 
compressed system. The Lineitem table of the dataset with scale factors 0.0001, 
0.001, 0.01, and 1 resulted in tuples of 600, 6000, 60000 and 6000000 respectively 
was chosen. The table was restricted to ten dimensions prior to the tests. The 
dimensions were: Orderkey, Partkey, Suppkey, Linenumber, Retumflag Linestatus, 
Shipdate, Commitdate, Receiptdate, Shipinstuct. Figure 2 shows the effectiveness of 
the compression. Table 3 gives a comparison of the volumes required for the four 
different relations derived from a Telecom data set. The datasets have mixed textual 
and numeric fields., each consisting of a single large relational table associating 
European Internet host and domain names with their IP addresses. The tables had 
eleven columns; one for the host name, four for the components of the domain names 
and six for the fields of the IP address. For each dataset, a primary index was 
constructed on the hostname along with a secondary composite index on the fields of 
the IP address. The overall compression factor is approximately 0.5. The Hibase 
prototype requires only approximately one eighth of the storage required by each 
conventional counterpart [24]. When the addressabilty of the individual tuples need 
not be preserved, a further reduction in data volume can be achieved by applying a 
secondary lossless compression algorithm prior to archiving and transmitting (Table 
3). These results give an indication that the upper limit of compression is likely to be 
approximately twice that of the current prototype system. 
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Fig. 2. Comparison in storage between Hibase and ASCII for the TPC-D dataset 
(scale factor 0.001) 



Flat File (MB) 


HIBASE Compression (KB) 


HIBASE + LZW Compression (KB) 


15 


7775 


3686 


10 


5305 


2178 


5 


2420 


1049 


1 


527 


227 



Table 3. First order compression (Hibase) and Second order compression (LZW). 





No of Processors 


10% selection (ms) 


1% selection (ms) 


Hibase 


lx 66 MHz 486DX2 


178 


14 


Starburst 


1x25 MHz RS6000 


169 


- 


Monet 


4x150 MHz MIPS 


900 


287 


Prisma 


5x 68020 


- 


248 



Table 4. lOK Wisconsin Benchmark: 13 numeric and 3 string attributes. 



4.2 Processing Performance comparison 

A number of groups have published performance results using the Wisconsin 
benchmarks [30]. These include work on parallel database architectures, and also 
approaches using conventional disk based systems in conjunction with high perfo- 
rmance memory [31], [32], [33], [34], [35], [36], [21]. Table 4 shows the Hibase 
performance figures relative to the others approaches for a lOK tuple Wisconsin 
Database, comprising 30 numerical and 3 string attributes. 
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Conclusion 

This paper demonstrates that appropriate data compression techniques have an 
important role to play in the efficient storage models for the Data warehouse 
architecture. Data compression provides a region of high performance, inaccessible to 
conventional disk-based databases and main-store-based databases, being faster than 
the former and less costly than the latter. The prototype system demonstrates good 
performance over a range of database functions. As many operations as possible are 
conducted using the compressed representation of the data directly, making proces- 
sing and I/O operations faster than on the corresponding uncompressed data. The 
compressed representation described here is in fact beneficial throughout the storage 
hierarchy, from processor cache to archive medium. Our prototype is not a complete 
database management system. However it is a demonstration of the validity of the 
design concept at the level of the basic relational operations. 
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Abstract. In maintaining data warehouse views without interfering current databases, a Join 
Differential File (JDF) scheme is introduced. The scheme uses differential files from relevant logs of 
databases and join differential files by capturing the referential integrity signal between the base 
relations. Cost functions are formulated, that are analyzed the performance of the JDF, the base 
method, and the pseudo differential method in various conditions. The algorithm is shown to be much 
better than the other two methods with the high communication speed, more screening situation, and 
small join differential files. 



1 Introduction 

Data warehouse is a repository of the integrated information, available for querying and analyzing (i.e., 
DSS and data mining) [2], [10], [1 1], The warehouse information, mainly derived from the base relations, 
can be stored as materialized views and can be aggregated or manipulated as multi-dimensional 
summaries. If some portions of the base relations are changed, the change should be applied to the views 
in order to guarantee the correctness of the data. Studies have been undertaken extensively on what is 
known as the view maintenance problem or the materialized view [1], [3], [7], [15], [17]. 

The materialized view maintenance with join operation is known to be complex [13]. Because the 
join is one of the most time-consuming and data intensive operations in relational database systems. In the 
data warehouse environment, join operations are very important to support multi-dimensional databases 
and an efficient join algorithm is essential [2], [3] 

Three ways can be classified to support the join materialized views such as the base table method 
(base), the pseudo differential method (pseudo), and fully differential method (differential or possibly 
called DJF). The base method is the simplest way. It is to re-execute the view definition, but it may cause 
unacceptable costs each time [4]. Similarly joining after full replication or that of versioning may be an 
alternative for the data warehouse environment. Several approaches such as [9], [11], and [15] are 
suggested, but they are a kind of modified base table methods and usually are used to lock the current 
databases. In some cases, additional efforts are needed to make those replications (or versions) consistent. 

Another efficient approach such as Incremental Access Method [14], coherency indexes [6], and 
relevant logging [3] is addressed. It maintains only the tuple identifiers (TIDs) and defers the updates to 
the base table as well as to the (join) views. It, however, turns out to be a ba.se table method at the point of 
updating the base tables and the relevant (join) views. So we can appropriately call it a pseudo differential 
method. 
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In this paper we propose a join view update algorithm with differential mode. In suggesting a join 
view update scheme, a basic principle is kept in mind; In order to make the data warehouse system more 
efficient, interferences on the current database should be minimized. 

The rest of the paper is organized as follows. Section 2 introduces the architecture of data warehouse 
views including a motivating example. Section 3 addresses the Join view-updating scheme. The cost 
functions with parameters and the performance analyses are suggested in Section 4 and 4 respectively. 
Section 6 concludes the paper. 

2 The Architecture for the Data Warehouse Views 



Rj(TIDi, Ai, Afic) and Rj(TIDj, Aj) are base relations, where the TID's are tuple identifiers of the base 
relations. A, (for all r = i, j) is a set of attributes, and the Afk is a foreign key of the S, that is relevant to 
TIDj of Rj. The data warehouse view is materialized and defined as V(VID„ A„ C„) where VID, is a view 
identifier, C„ is a conditional predicate, and A„ means a set of attributes of the view (for all v □ i. j). Let us 
suppose that we have a materialized view (V) defined as follows. 

Example 3.1 Employee information is collected in EMP table, and DEPT table has the 
department-name; EMP(s#, salary, dno, time) and DEPT(d#, dname, time). The view is defined as 
follows: 



CREATE MATERIALIZED VIEW V (no, salary, dno, department, time) 

AS SELECT E.e#, E.salary, D.dno, D.dname, T.time 

FROM EMP E, DEPT D, TIME T WHERE E.e# = D.d# and T.time=E.time; 

Suppose that the contents of table EMP, DEPT, TIME and the view V (fact_table) are represented with an 
ER Diagram as follows. 




The base relations are changed by transactions. In this paper we want to suggest the changed portion 
as a differential fde (DF). The file can be derived from the active log of the base relation. We use the 
differential file schema and its algorithm as [17]. The DF of base relation (R,) is defined as dRi(TIDi, Ai, 
operation-type, TS^). The operation-type indicates the type of operations applied to the changed tuple. It 
has one of the two codes: 'insert' or 'delete'. A modification, of course, assumed to be a delete and an 
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insert in series with the same time-stamp. The TS^ is a time-stamp that the tuple of DF is appended. (The 
d represents the DF.) Without loss of generality, the time-stamp is assumed to be the same, so TS' = TS. 
Then each record of changes is appended in the DFs (say, dRi and dRj) respectively. 



Example 3.2 We suggest some changes in the example 3.1 as follows. In dE, the first tuple 
represents that a new employee 'e7 is entered into the company at time 10:20. We can know that the 
salary of employee 'e2' is raised from 1000 to 1500 and his/her department is moved at time 10:25. Then 
employee 'el' is deleted and 'e8' is newly inserted. In table dD, we can also find that the name of the 
'Sales' department is changed to customer-service (CS). 
dE dD 



e# 


Salary 


dno 


operation-type 


TS 


e7 


4500 


2 


insert 


10:20 


e2 


1000 


3 


delete 


10:25 


e2 


1500 


2 


insert 


10:25 


el 


2500 


2 


delete 


10:30 


e8 


3000 


4 


insert 


10:40 



d# 


dname 


operation-type 


TS 


2 

2 


Sales 

CS 


Delete 

Insert 


10:35 

10:35 



3 The Updating Join View Scheme 



3.1 The Differential Join File 

In maintaining the join materialized views, we assume that joins are established within a referential 
integrity (RI) constraint. Of course there are so many non-foreign key joins, but joining by the foreign 
key* is the most frequent. Especially in data warehouse environment, the join with a star-schema or with a 
snowflake schema is almost always established in foreign keys [2]. Without loss of generality, we can 
assume that the base relation (R,) has a relationship with another relation(s) (say, Rf such that AfkO TIDj. 
In this case the R, is said to be a referencing relation, and the R, a referenced relation. 

When a tuple is changed (i.e., inserted, deleted, or updated) in a base table, a referential integrity (RI) 
constraint can be fired to check the relevance of the change. Not all the RI operations, but the insertions in 
the referencing relation are considered. Since the other changes such as an insert in the referenced relation 
and deletes in both referencing and referenced relation can be recorded in the DFs [11]. 

In this paper in order to update join materialized views without locking base relations, a new file 
called a join differential file (JDF) is introduced. The JDF is derived from the tuples of the referenced 
relation that the RI constraint indicates. The schema of the JDF of table Rj is defined as JR/TIDp Ap TS'). 



*) In establishing the referential integrity (RI) constraints, we assume that 'the safeness condition' [12] is satisfied 
regardless of the firing order of integrity constraints [8]. 
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The Tiy is the time-stamp that the tuple of JDF is appended. The superscript j represents the JDF. 
Without loss of generality the time-stamp is assumed to be 7S' = TS. 

In maintaining the JDF, the duplicated tuples should be eliminated. For screening duplicated tuples, 
various methods are suggested such as [1], [3], and [17]. In this paper, we adopt [17]. The Algorithm 
AppendJDF represents how the JDF is appended. The tuple in the referenced relation indicated by the RI 
(due to an insert in the referencing relation) is appended in the JDF. The algorithm DeleteJDF represents 
the duplicate elimination procedure. Basically the algorithm prevents the JDF from being increased 
unceasingly. 



Algorithm AppendJDF 


Algorithm DeleteJDF 


Input: jRj, the RI trigger in dRj 


Input: jRj, dRj, dRj, and an input to jRj 


Output: Consistent 


Output: Consistent 


Method: 


Method: 


[1] If there is an RI trigger in dRj, do 


[1] If there is an input in dRj or in jRj, else [6] 


[2] If the trigger is an insert, else [5] 


[2] If there is the same TID in jRj with the input of jRj, else [4] 


[3]Append the tuple that the trigger 


[3] Then substitute the tuple in jRj by the input value 


indicates in the jRj. 


[4] If there is the same TID in jRj with the input of dRj, else [6] 


[4] Do the DeleteJDF 


[5] Then delete the tuple in jRj 


[5] End 


[6] End 



Example 3.3 In the example 3.2, in order to insert a tuple (say, «7) in EMP, the Rf checking for the 
relevance is required. In this case, the check gets confirmation, for the foreign key dno=2 exists in DEPT. 
Then we can get the tuple dno = 2 from the base table DEPT and append it to the JDF (say, jD). By the 
second operation (e2), dno = 2 is substituted the same tuple with the time-stamp 10:25 in jD. But at time 
10:35 the duplicate elimination activated by a modification in dD, thus the dno = 2 is deleted in jD. Then 
by the last operation in dE, dno = 4 is appended. Therefore the jD is as follows; JD = {4, strategy, 10:40]. 



3.2 Maintaining Join Materiaiized Views 

Using the two files DF and JDF introduced above, the data warehouse can maintain the join-materialized 
views independent of the current database relations. In updating the views, it is necessary to determine 
which tuple needs to refer other relations. For the deleted tuples, it is sufficient to send the TID (with its 
TS) to the view. For inserted cases, all the changed contents (i.e., dRj, andJRj) should be sent to the 
view. With these files, the view can be updated. It means that in maintaining the join-materialized view 
by the JDF, there is no need to lock the base tables. 

Example 3.4 With the DFs {dE and dD) and the JDF (JD), the join view in the example 3.1 can be 
refreshed. The deleted tuple [el, 2500, 2, delete, 10:30] need not to refer the base table D. The other 
tuples of dE and dD in example 3.2 can be joined with the tuples of JD in example 3.3. Then the join data 
warehouse view (V) can be updated without locking the base tables at time 1 1:00 as follows. 
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no 


Salary 


Dno 


department 


Time 


el 


1000 


2 


cs 


11:00 


e4 


5000 


3 


R&D 


11:00 


e5 


2500 


2 


CS 


11:00 


e6 


3000 


1 


computer 


11:00 


e7 


4500 


2 


CS 


11:00 


e8 


3000 


4 


Strategy 


11:00 



3.3 The Base Table Algorithms 

In comparing the algorithms, we consider Semi-join as a base table method (Algorithm Base) and a 
pseudo-differential method (Algorithm Pseudo). The algorithm Pseudo is assumed to utilize a set of tuple 
identifiers and adopts Semi-join in joining a view. The distribution of operation type among tuples is a 
random variable; it is assumed that the operation type distribution is the same as that of the differential 
tuples. The algorithms are addressed as follows. 



4 Parameters and Cost Functions 



Algorithm Base 




Algorithm Pseudo 


Input: join view v, R/, Rj 




Input: join view v, dRi, Ri, dRj, Rj 


Output: Consistent view v 




Output: Consistent view v 


Method: 




Method: 


[1] ForR/v^Null Do 




[1] For dRi or r/Rj ^^Null Do 


[2] Read all tuples in the R, 




[2] Read all tuples in the base table dRi and dRj 


[3] Read the tuples of Rj that matches with the 




[3] Update R. andRj 


attributes of R, 




[4] Do Semi-join with R, and Rj 


[4] Join those tuples to the view v 




[5] Join those tuples and send it to the view v 


[5] Update the view v 




[6] Update the view v 



4.1 Parameters 



R„ dR„ jR, Base table, DF, and JDF of R, respectively for r = i, j. 
dRij) dRu A set of deleted and inserted tuples in DF of R, respectively. 

B Page size (bytes). 

Ci/o, Ccom I/O cost (ms/block) and transmission cost (bits/s). 

(p[k, n, m] Cost that accesses k records in a file of n records stored in m pages [5]. 

N(dRj), N(jRi) Number of tuples of dRi per page (=B/WRi) and the number of JDF respectively. 

Os, SF Screen factor and the Semi-join factor respectively. 

Wb , FIb Width and the height of the B'^tree respectively, where FIb = log bav Us *N(R). 

W/, Wo Width (bytes) of each tuple with the operation-type = insert and delete respectively. 

WdRi, W„ Width of dRi and width of view v respectively. 
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4.2 The Cost Function of JDF 

I/O cost functions and communication cost functions are denoted by NIO'f, and NCOM's respectively. In 
this paper we divide the inserted tuple and the deleted tuple. For the size of the deleted tuple is apparently 
smaller than that of inserted tuple. It is sufficient for the deleted tuple to send the identifier and its time- 
stamp (instead of all the tuple contents) in maintaining join (data warehouse) views. Cost functions and 
their explanations are denoted as follows: 

NIOl = Cost of reading dRi and maintaining it = Cyo[N(dRij))WD + N(dRu)Wi]/B + 2*Ci/o[(H^r 
r)+tp[SF*N(dRi), SF*N(dRi)Wjg/B, ct,* SF*N(dRi)WjR/B]]. 

NIOl = Cost of reading dRj and maintaining it 

= Cyo[N(dRj_n)WD + N(dRj,,)W,]/B + 2*Cyo[(H^Rrl)+<P[N{dRj), N{dRj)W^/B, a*N(dR,)WyRjmi 
NI03 = Cost of reading JSj and maintaining the JDF 

= Cyo*N(jRj)*WR/B + 2*CyomRrl)+<P[N(jRj), N(JRj)Wr/B. a*N(jRj)WR/B]]. 

NI04 = Cost of accessing the B'^tree of the view index and reading the view table 
= 2*Cyol(Hv-i)+<P[N(V), N{V)WyB, ct,*V(V)lV/B]]. 

NCOMl = Cost of transmitting tuples (dRi) to the view < 8* SF*N(dRi)*WjR/Ccom- 
NCOM2 = Cost of sending joined tuple to the view = 8*N(dRj)*WjR/Ccom 
NCOM3 = Cost of sending the jRj to the view < H*N(jRj)*WjR/Ccom- 

Then the total cost of the algorithm JDF is NIOl + NI02+ NI03 + NI04+ NCOMl + NCOMl + NCOM3. 

5 Performance Analysis 

The following values are assigned to the parameters for the analysis. The block size is generally assumed 
to be B = 4000 bytes, and the I/O cost Cyo = 25 ms/block. The size of the base table is assumed to be the 
same as Wri = Wrj = WdRi = W^rj = 200 bytes. The cardinality of the base table is assumed to be 1,000,000 
and 500,000, and the size of differential file 10% of each base tables respectively in the experiment. The size 
of the inserted tuple is assumed to be the same as the base table, i.e., Wi = 200 bytes. In the deleted case, 
the identifier (TID) will only be sent, so Wd =8 bytes. The communication speed is varied from very low 
case and high-speed case, that is Ccom = 100Kbps ~ 10Mbps. Tuples are filtered from no screening case 
(ocj = 1.0) and highly screened case (ocj = 0.01). Three methods are analyzed such as (1) the base table 
method (Base), (2) the pseudo-differential method (Pseudo), and (3) the differential method (JDF). 

Total six figures are suggested with respect to various criteria. Fig. 1 through Fig. 3 show that 
the total costs are all strongly dependent both on the screen factor and on the communication speed. Fig. 1 
is different from the others, which represents that the size of the file is the most critical factor. If the 
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tuples are filtered highly (up to about 0.01), the base table method is less advantageous than the JDF or 
the Pseudo method. Fig. 3 represents that if tuples are screened little (so, the screen factor is up to about 
1.0), the JDF is less advantageous than the Pseudo or even in some cases than the Base method. It means 
that the differential method is not always preferable. But in reality the size of the DF is liable to be 
smaller than that of the base table, for the DF means a changed part of base table. 

Fig. 4 and 5 represent the traverses of each cost due to the screen factors. The Base method is 
relatively stable as the screen factor increasing, but the Pseudo method increased rapidly and the JDF in 
the middle. As increasing the communication speed, the JDF gradually close up to the Pseudo method. 
Which means that the two methods are highly dependent upon the communication speed. Fig. 5 
represents that in a low communication speed (100Kbps) the JDF is apparently lower than the other two 
methods. Fig. 4 and 5 represent two facts as follows: (1) If a base table is updated frequently, then the 
view maintenance by the JDF or by the Pseudo method is not said to be significantly advantageous 
(regardless of the communication speed). (2) If the communication speed is low (under 1Mbps), the JDF 
is apparently advantageous. 

Fig. 6 represents that the costs of the three methods are depicted with respect to the size of the DF. 
The Base method is unwavering according to the changes of DF&, for it does not use the DF. It shows that 
the JDF scheme is preferable to the Base method, if the size of DF is less than that of the base table. 

6 Conclusion 

In this paper the JDF scheme with related algorithms is addressed for a data warehouse environment. 
Utilizing differential files and a join differential file, the scheme is shown to be appropriate in maintaining 
the (data warehouse) join views without accessing the current database relations. Three methods such as 
(1) Base method, (2) pseudo-differential method (Pseudo), and (3) a differential method (JDF) are 
analyzed. Cost functions and trial runs showed that the JDF is useful in maintaining data warehouse join 
views. In the experiment, if the tuple is much screened and the communication speed is under 1Mbps, 
then the JDF is apparently advantageous. If a base table is updated frequently and less screened in ultra 
high communication speed, then the view maintenance by the JDF or by the Pseudo method is not 
significantly efficient. If the communication speed is low (under 1Mbps), the JDF is apparently 
advantageous. 




On the Independence of Data Warehouse from Databases in Maintaining Join Views 



93 



Screen = 0.01 




Fig. 1. Cost traverse of JDF, Pseudo, and Base method with screening=0.01 



S c e n =0.1 




Fig. 2. Cost traverse of JDF, Pseudo, and Base method with screening=0.1 
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Fig. 3. Cost traverse of JDF, Pseudo, and Base method with screening=l .0 



94 



W. Lee 



lOOK 




0.01 O.i 0.4 0.7 

Screer 1 actor 



Fig. 4. Cost traverse of JDF, Pseudo, and Base method with communication speed = lOOkhps 
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Fig. 5. Cost traverse of JDF, Pseudo, and Base method with communication speed = lOMhps 




Fig. 6. Cost traverse of JDF, Pseudo, and Base method with respect to the size of the differentiai file 
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Abstract. A Data Warehouse (DW) can be abstractly seen as a set of 
materialized views defined over relations that are stored in distributed 
heterogeneous databases. The selection of views for materialization in a 
DW is thus an important decision problem. The objective is the mini- 
mization of the combination of the query evaluation and view mainte- 
nance costs. In this paper we expand on our previous work by proposing 
new heuristic algorithms for the DW design problem. These algorithms 
are described in terms of a state space search problem, and are guaran- 
teed to deliver an optimal solution by expanding only a small fraction of 
the states produced by the (original) exhaustive algorithm. 



1 Introduction 

A Data Warehouse (DW) can be seen as a set of materialized views defined 
over distributed heterogeneous databases. All the queries posed to the DW are 
evaluated locally using exclusively the data that are stored in the views. The 
materialized views have also to be refreshed when changes occur to the data of 
the sources. The operational cost of a Data Warehouse depends on the cost of 
these two basic operations: query answering and refreshing. The careful selection 
of the views to be maintained in the DW may reduce this cost dramatically. For 
a given set of different source databases and a given set of queries that the 
DW has to service, there is a number of alternative sets of materialized views 
that the administrator can choose to maintain. Each of these sets has different 
refreshment and query answering cost while some of them may require more disk 
space than the available in the DW. The Data Warehouse design problem is the 
selection of the set of materialized views with the minimum overall cost that fits 
into the available space. 

Earlier work [8] studies the DW design and provides methods that generate 
the view selections from the input queries. It models the problem as a state space 
search problem, and designs algorithms for solving the problem in the case of 
SPJ relational queries and views. 

* Research supported by the European Commission under the ESPRIT Program LTR 
project ”DWQ: Foundations of Data Warehouse Quality” 
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1.1 Related Work 

Many authors in different contexts have addressed the view selection problem. H. 
Gupta and I.S. Mumick in [2] use an A* algorithm to select the set of views that 
minimizes the total query-response time and also keeps the total maintenance 
time less than a certain value. A greedy heuristic is also presented in this work. 
Both algorithms are based on the theoretical framework developed in [1] using 
AND/OR view directed acyclic graphs. In [3] a similar problem is considered 
for selection-join views with indexes. An A* algorithm is also provided as well 
as rules of thumb, under a number of simplifying assumptions. In [10], Yang, 
Karlapalem and Li propose heuristic approaches that provide a feasible solution 
based on merging individual optimal query plans. In a context where views are 
sets of pointer arrays, Roussopoulos also provides in [7] an A* algorithm that 
optimizes the query evaluation and view maintenance cost. 



1.2 Contribution and Paper Outline 

In this paper we study heuristic algorithms for the DW design problem. Based 
on the model introduced in [8, 9] we introduce a new A* algorithm that delivers 
the optimal design. This algorithm prunes the state space and provides the 
optimal solution by expanding only a small fraction of the whole state space. 
We also present two variations of the heuristic function used in A*, a ‘static’ 
and a ‘dynamic’ heuristic function. The dynamic heuristic function is able to 
do further pruning of the state space. To demonstrate the superiority of the A* 
algorithm, we compare it analytically and experimentally with the algorithms 
introduced in [8] . 

The rest of the paper is organized as follows. In Section 2 we formally define 
the DW design problem as a state space search problem providing also the cost 
formulas. In Section 3 we propose a new A* algorithm that delivers an optimal 
solution for the DW design problem. Improvements to the A* algorithm are 
proposed in Section 4. Section 5 presents experimental results. We summarize 
in Section 6. 

2 The DW design problem 

We consider a nonempty set of queries Q , defined over a set of source relations 
R. The DW contains a set of materialized views V over R such that every query 
in Q can be rewritten completely over V [4]. Thus, all the queries in Q can be 
answered locally at the DW, without accessing the source relations in R. By 
<3^, we denote a complete rewriting of the query Q in Q over V. 

Consider & DW configuration C = < V,Q^ > [8, 9]. We define: 

-E(Q^) : The sum of the evaluation cost of each query rewriting Qf in mul- 
tiplied by the frequency of the assosiate input query Qi , 

M(V) : The sum of the view maintenance cost of each view in V, 

S(V) : The sum of the space needed for all views in V, 
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T(C) : The operational cost of C where: 

T(C) = cT;(Q^) + M(V) 

The parameter c indicates the relative importance of the query evaluation cost 
and view maintenance cost. 

The DW design problem can then be stated as follows [8]: 

Input: A set of source relations R. A set of queries Q over R. The cost functions 
E, M, T. The space t available in the DW for storing the views. A 
parameter c. 

Output: A DW configuration C =< V, > such that S'(V) < t and T(C) is 
minimal. 

In this paper we investigate the DW design problem in the case of selection- 
projection-join conjunctive queries without self-joins. The relation attributes 
take their values from domains of integer values. Atomic formulas are of the 
form X op y + c or X op c, where x, y are attribute variables, c is a constant, and 
op is one of the comparison operators =,<,>,<,> but not A formula F im- 
plies a formula F' if both involve the same attributes and F is more restrictive. 
( For example A = B implies A < R-l- 10). Atoms involving attributes from only 
one relation are called selection atoms, while those involving attributes from two 
relations are called join atoms. 

2.1 Multiquery graphs 

A set of views V can be represented by a multiquery graph. A multiquery graph 
allows the compact representation of multiple views. For a set of views V, the 
corresponding multiquery graph, G^, is a node and edge labeled multigraph. 
The nodes of the graph correspond to the base relations of the views. The label 
of a node R, in is the set containing the attributes of the corresponding 
relation that are projected in each view of V. For every selection atom p of the 
definition of a view V, involving attributes of Ri there is a loop on i?, in G^ 
labeled as V : p. For every join atom p of the definition of a view V, involving 
attributes of R, and Rj there is an edge between J?, and Rj in G^ labeled as 
V : p. The complete definition of the multiple query graph appears in [8] . 



2.2 Transformation Rules 

In [8] we defined the following five transformation rules that can be applied to 
a DW configuration. 

Edge Removal: A new configuration is produced by eliminating an edge la- 
beled by the atom p from the query graph of view V, and the addition of an 
associated condition to the queries that are defined over V. 

Attribute Removal: If there are atoms of the form A = B and A, B are 
attributes of a view V, we eliminate A from the projected attributes of V. 

View break: Let V be a view and N \ , N 2 two sets of nodes labeled by V in 
G^ such that: (a) Ni ^ N2,N2 ^ 7V1, {h)Nl U N2 is the set of all the nodes 
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labeled by V in and (c) there is no edge labeled by V between the nodes in 
Ni — N 2 and N 2 — N\. In this case this rule replaces V by two views, Vi defined 
over 7Vi and V 2 defined over N 2 - All the queries defined over V are modified to 
be defined over Vi x ¥ 2 - 

View Merging: A merging of two views Vi and V 2 can take place if every 
condition of Vi (V2) implies or is implied by a condition of V 2 (Vi). In the new 
configuration, Vi and Vi are replaced by a view V which is defined over the same 
source relations and comprises all the implied predicates. All the queries defined 
over Vi or Vi are modified appropriately in order to be defined over V. 

Attribute Transfer: Suppose there are atoms of the form A = c, where A is 
an attribute of a view V, we eliminate A from the projected attributes V. All 
the queries defined over V are modified appropriately. 



2.3 The DW design problem as a state-space search problem 



The DW design problem is formulated as a state space search problem. A state s 
is a DW configuration C =< V, >. In particular, the state C =< 0*5, Q'^ > 
that represents the complete materialization of the input queries, is called the 
initial state sq. There is a transition T(s, s') from state s to state s', iff s' can be 
obtained by applying any of the five transformation rules to s. It can be shown 
that by the application of the above transformation rules we can get all possible 
DW configurations [8]. With every state s we associate, through the function 
T(C), the operational cost of C. Also, the space needed for materializing the 
views in V is given by S'(V). We can solve the DW design problem by examining 
all the states that are produced iteratively from sq and report the one with the 
minimum value for the function T(C) that satisfies the constraint S(V) < t. 

It was evident that the number of all produced states of the state space 
is too large. An algorithm that solves the DW design problem by searching 
the state space within an acceptable time has to prune the state space and 
examine only a limited fraction of the states. Given a transition T(s,s'), the 
operational cost T(C') and the space 5(V') of s' are greater, equal or less than 
the corresponding T(C) and 5(V) of s. Hence any algorithm that wishes to 
guarantee the optimality of the solution it delivers, it needs to examine every 
feasible state of the state-space. 

In order to provide an algorithm that will be able to deliver an optimal 
solution for the DW design problem but at the same time will prune down the 
size of the state-space, we proceed as follows to alter the way states are created. 

Consider the states si =< 0*5', Qf' s„ =< >, where 

Qi = = {Qn}- Let Si = {sl,...,s*'} denote the set of all the 

feasible states created from state s,, * = 1, . . . ,n, by applying to < 0*5% Q?’ > 
the transformation rules: edge removal, attribute removal, view break and at- 
tribute transfer. It is not hard to see that view merging cannot be applied to 
< G‘5’ , > or any of the other state produced from Si . 
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Fig. 1. The states of the set 5i (a) and S 2 (b) 

Example 1. Consider the queries Q\ = ttbc{(^b<w{S) ^a<k T), and <32 = 
7Tyi(o'B<5(5')). Figure 1(a) shows the elements of the set 5i that we can get from 
Qi while Figure 1(b) shows the elements of S 2 that we can get from Q^. 

Consider now two configurations < Vi,Q]^' > and < V 2 ,Q^^ >. By com- 
bining these configurations we can create a new configuration < V, > as 
follows: V = Vi U V 2 , U The nodes of the multiquery graph of 

the new configuration are the nodes of the union of the original configurations. 
For each edge of the two original multiquery graphs, an identical edge is added 
to the multiquery graph of the new configuration. The same happens for each 
node label. The new multiquery graph expresses collectively all the views and 
the query rewritings of the two original configurations. 



Depth 0 



Depth 1 



Depth 2 



Depth 3 



Depth n 




Fig. 2. The tree of the combined states 



Combinations between states are defined similarly to combinations between 
configurations. By performing the combination between states we can create the 
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tree of combined states of the Figure 2. This tree is defined as follows: The 
nodes of the tree are states and combined states. The root node is a state with 
no views or query rewritings (< 0,0 >). The children of the root node are the 
states s}, . . . ,Sjh These nodes are at depth 1. The children of a node s which 
are expanded at depth d, d> 1, are the combined states resulting by combining 
s with each one of the states . . . , plus all the states resulting by applying 
the view merging transformation rule to these combined states (not necessarily 
once) . The leaves of the tree (nodes at depth n) are the states of the state space of 
the DW design problem as this was formulated earlier. The tree of the combined 
states of the example 1 is shown in Figure 3. 




The operational cost function T and the space function S can be defined at 
each node of the tree of combined states. At each node, T and S express the 
operational cost and the space of the corresponding configuration. By induction 
we can prove that given a node rii at depth d\ and a node ri 2 at depth d 2 where 
di < d 2 and rii is an ancestor of ri 2 then T(rii) < T{ri 2 ) and S{ni) < S{ri 2 )- 
This is true under the assumption that we consider no multi query optimization 
and every view is maintained separately without using auxiliary views [9]. The 
fact that the cost and the space function monotonically increase while we visit 
the nodes of the tree from the root to the leaves, allows the design of algorithms 
that find the optimal DW configuration by exploring only a small fraction of the 
state space. The following Branch and Bound algorithm is such an algorithm. 
Branch and Bonnd algorithm: The algorithm generates and examines the 
tree in a depth-first manner. Initially it sets c = oo. When it finds a leaf node 
state s that satisfies the space constraint and has cost T{s) < c it keeps s as Sgpt 
ans sets c = T(s). The generation of the tree is discontinued below a node if this 
node does not satisfy the space constraint or its cost exceeds c. When no more 
nodes can be generated it returns the Sgpt as the optimal DW configuration. 
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3 A* Algorithm 

We present an A* algorithm [6] that searches for the optimal solution in the 
tree of the combined states. The new algorithm prunes down the expanded tree 
more effectively than the Branch and Bound algorithm because at each step it 
uses also an estimation of the cost of the remaining nodes. The A* algorithm 
introduces two functions, g{s) and h{s) on states. The value of g{s) expresses the 
cost of the state s and is defined as the total operational cost of the associated 
configuration C (g{s) = T{C)). The value of h{s) expresses an estimation of 
the additional cost that will be incurred in getting from s to a final state. h{s) 
is admissible if it always returns a value lower than the actual cost to a final 
state. If h{s) is admissible then the A* algorithm that searches the tree of the 
combined states is guaranteed to find a final state (leaf node) Sgpt such that 
the operational cost of Sgpt is minimal among all final states [6]. In order to 
define h{s), we introduce the function l{sj) for each sj £ Si, the set of feasible 
states created from < G'^’ , Q^’ >. This function expresses a lower bound of the 
estimated cost that will be added to a combined state s', in case s' is produced 
by the combination of sj with a third state s. The value of the function Z(s^) 
is the operational cost of the associated configuration < Vj , > minus the 

view maintenance cost of the views that may contribute to a view merging. 

= ^(< Q*'’ >) “ M{V^), Vj^ contributes to a view merging 

k 

We can also define L{Si) as the minimum l{sl) for each sj £ Si'. 

L{Si) = min[/(s^)], sj £ Si 
i 

For a state s at depth i the heuristic function h{s) is defined as: 

n 

h{s) = Y. L{Sj) 
i=*+i 

Proposition 1. For every leaf node si whieh is sueeessor of the node s, T(si) > 
g(s) + h(s) holds. 

The proof of the Proposition 1 is presented in [5] . 

A* Algorithm: The A* algorithm proceeds as follows: First it initializes c = oo, 
constructs 5i , . . . , and begins the tree traversal from the root node. When 
the algorithm visits a node it expands all its children. It computes the function 
g{s) + h{s) for each one of the generated nodes and also the space function 
S{s). Then, it continues to generate the tree starting from the state which has 
the lowest cost g{s) + h{s). The generation of the tree is discontinued below 
a node if this node does not satisfy the space constraint or when g{s) + h{s) 
exceeds c. When the algorithm finds a leaf node state s/ that satisfies the space 
constraint and has cost T(s) < c, it keeps s/ as Sgpt and sets c = T{s). When 
no more nodes can be generated, the algorithm returns the Sgpt as the optimal 
DW configuration. 
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4 Improvements to the basic A* Algorithm 



Consider two admissible heuristic functions hi and /i 2 - hi is said to be more 
informed than /i 2 if for every non-final state s, hi{s) > h^is). When an A* 
algorithm is run using /i 2 , it is guaranteed to expand at least as many nodes as 
it does with hi [6] The definition of h uses functions L{Si), l{sj) to pre-compute 
the additional cost from a state s to a final state. l{sj), at each step, excludes 
from the estimation the maintenance cost of each view that may participate to 
a view merging, without considering the maintenance cost of the new view that 
will replace the merged views. Another point is the fact that in some cases the 
maintenance cost of a view is eliminated even if the merging of this view will 
generate no successor node. In order to get a more informed heuristic function, 
we define a new “dynamic” heuristic function h' . The new heuristic function 
uses the functions L' and I' which are “dynamic” versions of functions the L 
and 1. The functions L' , I' are called “dynamic” because they are recomputed at 
each algorithm iteration. The main advantage of these functions is that they are 
able to exploit information from the states already expanded, making the new 
heuristic function h' more informed than h. For each state s at depth d and for 
each sj € Si where i > d the function l'{sj,s) is defined as follows: 



l'{si,s) = T(< V,-, Qp >) - ^ M(y/) + ^ W(y/, s) 

k k 






0 

M{Vf) 

T?+r 



if 3 V G s, V can be merged with 
otherwise 



where Vj^ may be contributing to a view merging and n* is the number of views 
that can be merged with Vj^ and these views are in any of 5d+i, . . . , 5„. 

The function L'{Si,s), similarly as the function L{Si), is defined as the min- 
imum l'{sj,s) for each sj € 5*. The heuristic function h' for a state s at depth 
i is defined also as the sum of the function L'{Sj,s) for each j € -I- 1, . . . ,n]. 

In [5] we prove that the heuristic function h' is admissible. Obviously h'{s) > 
h{s) for every state s of the tree of the combined states, so h' is more informed 
than h. That means that when the A* algorithm uses h' , less nodes are expanded 
compared to the case where h is used. 



5 Experimental Results 

We have performed a sequence of experiments to compare the performance of 
the Branch and Bound algorithm and the A* algorithm, the latter using the 
two heuristic functions presented in the previous sections. The algorithms are 
compared in terms of the following factors: (a) the complexity of the input query 
set and (b) physical factors. The complexity of the query set is expressed by 
three parameters: the number of input queries, the number of selection and join 
edges of all the input queries, and the overlapping of the queries in the input 
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(a) CPU time vs size of input (b) CPU time vs number of edges 




(c) CPU time vs number of implications (d) CPU time vs physical characteristics 
Fig. 4. Experimental Results 

query set. The query overlapping is expressed by the total number of implications 
between the selection or join atoms of different input queries. Finally the physical 
factors are expressed by the relative importance of the query evaluation and view 
maintenance cost (factor c in Section 2). 

Experiment 1: The number of queries varies. We study the performance of 
the algorithms when the number of the input queries varies. Figure 4. (a) shows 
the CPU time needed by each algorithm. The performance of the A* algorithm 
using the dynamic heuristic function h' is denoted hy A* Impr. Actually it is 
this algorithm that is significantly better than the other two. 

Experiment 2: The number of edges varies. We study the performance of 
the algorithms as the number of selection and join edges of the input queries 
varies. Figure 4.(b) shows the CPU time needed by the Branch and Bound 
algorithm and the two variations of A* take. In this experiment too the improved 
A* algorithm is the winner. 

Experiment 3: The number of implications varies. We study the perfor- 
mance of the algorithms while varying the number of implications between atoms 
of different queries. Figure 4.(c) shows the CPU time taken by the algorithms. 
As the number of implications grows and before it exceeds a certain limit, both 








Heuristic Algorithms for Designing a Data Warehouse with SPJ Views 



105 



the Branch and Bound and the A* algorithm execution time increases. When 
the number of implications exceeds this limit, the algorithms become fasters. 
Experiment 4: parameter c varies. We run the algorithms while varying the 
parameter c. Figure 4.(d) reports the CPU time needed by the algorithms. When 
c is close to 0 or much greater than 1 (the view maintenance cost or the query 
evaluation cost is important) , then all the algorithms perform very efficiently. In 
the middle interval the algorithms are slower. 

6 Summary 

In this paper we have studied heuristic algorithms that solve the DW design 
problem, by extending the work presented in [8]. We have studied the DW design 
problem as a state space search problem and proposed a new A* algorithm that 
guarantees to deliver an optimal solution by expanding only a small fraction of 
the states produced by the (original) exhaustive algorithm and the Branch and 
Bound algorithm proposed in [8] . We have also studied analytically the behaviour 
of the A* algorithm and proposed a new improved heuristic function. Finally we 
implemented all the algorithms and investigated their perfomance with respect 
to the time required to find a solution. 

Interesting extensions of the present work include the following: (a) The use 
of auxiliary views in the maintenance process of the other views, and (b) The 
enlargement of the class of queries to include aggregate queries. 
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Abstract. We propose the Posse^ framework for optimizing incremen- 
tal view maintenance at data warehouses. To this end, we show how for 
a particular method of consistent probing it is possible to have the power 
of SQL view queries with multiset semantics, and at the same time have 
available a spectrum of concurrency from none at all as in previously 
proposed solutions to the maximum concurrency obtained by issuing all 
probes in parallel. We then show how optimization of the probing pro- 
cess can be used to select various degrees of concurrency for the desired 
tradeoffs of concurrency against processing cost and message size. 



Keywords: 

View Maintenance, Data Warehouse, Distributed Query Optimization. 

1 Introduction 

Data warehousing is increasingly used to collect information from diverse and 
possibly heterogeneous data sources and to provide a platform for quick response 
to queries. For efficient query processing, the views are commonly materialized 
at the data warehouse, giving rise to the need to keep the views up-to-date 
with respect the data sources. This can either be performed by complete re- 
computation of the views or by incremental maintenance techniques [1-3]. 

We consider the case of a warehouse with multiple data sources, which may be 
separate physical sites or independent relations within a site. The data warehouse 
stores materialized views which are expressed as SQL queries over data sources. 
Each data source locally responds to user queries and updates. Updates at the 
data sources are propagated as update messages to the warehouse and must be 
integrated into the materialized view in order to keep it up-to-date, but this 
poses several problems. 

* This work is partially supported by the NSF under grant numbers CCR-9712108 
and EIA-9818320 

^ In the USA at least, a posse is a group of citizens temporarily deputized into the 
police force, for a (more or less loosely coordinated) search. 
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First, the update message contains information only about changes at the 
data source where the update occurred, but the corresponding change to the 
view may depend on information at other data sources. Accordingly, the data 
warehouse must probe (i.e. query) the other data sources where the additional 
information resides, and must delay installation of the change to the materialized 
view until all the necessary information arrives. This is in effect a distributed 
query, which gives rise to several optimization issues regarding the sequence in 
which to send the probes, and the data to be requested in each probe. 

Second, because of the delays introduced in probing the data sources, it 
can happen that updates occur concurrently at the data sources being probed, 
and consequently the replies to the probes may reflect updates that occur in 
some sense “after” the update which is being processed. Without some sort 
of compensation, the effects of these interfering updates could result in the 
materialized view becoming inconsistent with respect to the data sources. 

Existing algorithms for incremental maintenance of materialized views in 
a data warehouse [4-6] have focused primarily on this problem of compensa- 
tion for interfering updates. The EGA, STROBE and SWEEP algorithms limit 
themselves to select-project-join (SPJ) view queries and relational (i.e. duplicate- 
free) semantics, whereas warehouse views typically use aggregation and group- 
by queries to reduce query processing times, and multiset (bag) semantics. The 
PSWEEP algorithm [9] extends SWEEP by introducing parallel evaluation of 
updates, but retains SWEEP’S focus on linear joins and relational semantics. 
We develop the Posse framework which can express all these prior algorithms 
and adds concurrency within each update, multiset semantics, aggregates and 
group-by queries, and optimizations of the query plans. 

The rest of this paper is organized as follows: Sect. 2 presents the definitions 
and assumptions for the rest of the paper. Section 3 introduces a motivational 
example that will be used in the rest of the paper. Section 4 describes the 
theoretical and conceptual environment for addressing issues of consistency and 
correctness. Section 5 presents the main result of this paper, the framework in 
which optimizations and consistency can coexist. Section 6 concludes the paper. 

2 Definitions and Assumptions 

The data sources report to the data warehouse all updates to their local rela- 
tions, and the data sources are capable of processing and responding to SQL 
queries relating to their contents. The data warehouse defines one or more ma- 
terialized views expressible in SQL, including GROUP BY and HAVING clauses, 
and aggregate functions. Queries from the data warehouse take the form of SQL 
queries. We refer to such a query as a probe. We refer to a probe and its reply 
as belonging to the update or to the update message that gave rise to the probe. 
Where it is not ambiguous, we use a name of a relation as found in the view 
query to label a data source and updates originating from that source. 

Communication is via reliable point-to-point EIEO channels. Data sources 
are independent in that they do not synchronize or communicate with each 
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other, and updates to the data sources are mutually independent. Data sources 
process updates and queries atomically, queries are processed in the order of 
their arrival, and messages relating to these events are transmitted in the same 
order as the processing of the events. 

An update message takes the form of a delta table which is a relation aug- 
mented with a signed integer cardinality. The cardinality is not an attribute of 
the schema, but encodes both the number of duplicates of a tuple for multiset 
semantics and the insertion (positive) or deletion (negative) sense of an update 
message or probe reply. We use the terms multiset and relation interchangeably 
in this paper, and regard a multiset as having the form of a delta table with only 
positive cardinalities, although we omit the cardinality from multiset examples 
when the cardinalities are all +1. We extend the operations of relational algebra 
to apply to delta tables in the natural way, consistent with their interpretation 
as multisets. 

3 Motivation 

We use an example to demonstrate the operation of a data warehouse. This ex- 
ample comprises a data warehouse with a single view defined over the three data 
sources of Fig. 1, which are (part of) an entity-relationship schema describing 
university enrollment and teaching assistant appointments. The data warehouse 
materializes a single view over these relations: 

CREATE VIEW DoubleStatus(Prof, Conflicts) 

AS SELECT Tl.Prof,COUNT(E.Student) 

FROM Assists A, Enrollment E, Teaches Tl, Teaches T2 
WHERE Tl.Prof = T2.Prof AND Tl. Course = A. Course 
AND T2. Course = E. Course AND E. Student = A. Student 
GROUP BY Tl.Prof 



Teaches 



Fig. 1. The Example Base Relations 



Course 


Prof 


CSlOl 


Jones 


CS102 


Smith 


CS103 


Smith 


CS201 


Jones 



Assists 



Student 


Course 


Kevin 

Mary 


CSlOl 

CS102 



Enrollment 



Student 


Course 


Kevin 


CS201 


Mary 


CS103 


Mike 


CSlOl 


Sharon 


CS102 



The EGA algorithm [4] cannot be used for our example view because EGA 
contemplates only a single data source. The STROBE algorithms [5] cannot 
handle the view because they require the view to contain a key for each base 
relation. The SWEEP algorithms [6] cannot support the view because of their 
limitation to select-project-join view queries. 
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The corresponding state of the view would be DoubleStatus in Fig. 2A, re- 
flecting the fact that Jones and Smith each have one student (resp. Kevin and 
Mary) who is simultaneously an assistant and a student. The DoubleStatus view 
is an interesting example of a query that can be used to expose multiple rela- 
tionships which may potentially represent e.g. conflict-of-interest relationships. 




Fig. 2. The States of the View 



Now suppose that an update occurs at the Enrollment relation, replacing 
the tuple (Mary, CS103) with the tuple (Mary, CS201) and Enrollment sends 
the update message update E of Eig. 3, and also suppose that there is an up- 
date to the Teaches relation replacing the tuple (CS201, Jones) with the tuple 
(CS201, Smith) resulting in the update message updateT of Eig. 3. Let us also 
suppose that these update messages arrive at the data warehouse in the order 
given, before processing begins on either message. 



update E (from Enrollment) update T (from Teaches) 



Student 


Course 


Cardinality 


Mary 


CS201 


-1-1 


Mary 


CS103 


-1 



Course 


Prof 


Cardinality 


CS201 


Smith 


-1-1 


CS201 


Jones 


-1 



reply{E,T) (from Teaches, for update E) 



Course 


Prof 


Student 


Cardinality 


CS103 


Smith 


Mary 


-1 


CS201 


Smith 


Mary 


-1-1 



Fig. 3. The Messages 



The first update {update E) should change the contents of the DoubleStatus 
view so that Mary no longer has double status with regards to Smith. The 
state of the view after this update should be DoubleStatus in Fig. 2B. However, 
in order to arrive at this result, the update messages have to be joined with 
information from other base relations. The view contains none of the attributes 
that appear in update E, so that it becomes necessary to obtain information 
from, for example, the Teaches relation. But since updateT has already arrived 
from that relation and is reflected in the contents of the Teaches relation, the 
current state of Teaches would be incorrect for computing the effect of update E 
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by itself to produce Fig. 2B. Some sort of compensation for the effects of the 
concurrent updateT must be applied. 

4 The Environment 

In this section we present the computational and conceptual environment for 
our framework for view maintenance. The correctness model is the conceptual 
standard for our definitions of consistency. The computational model comprises 
the machinery available to our algorithms. 



4.1 The Correctness Model 

Consider a system architecture in which, in addition to the materialized view, the 
data warehouse materializes copies or mirrors of all of the base relations. Updates 
to the base relations at the data sources are sent as update messages to the data 
warehouse. The processing of an update message at the data warehouse is a 
single atomic action which includes updating the materialized copy of the base 
relation, reevaluation of the view query, and installation of a new materialized 
view. We regard it as axiomatic that the operation of a system of this architecture 
is correct, and use it as the standard for comparison from the points of view of 
correctness and consistency. 

4.2 The Computational Model 

The framework we propose uses the computational model illustrated in Fig. 4. 
In this execution environment, the data warehouse does not contain mirrors of 
the base relations. As a result, it is necessary for the data warehouse to send 
probes to the data sources to obtain the information to complete processing of 
each update message. 

The incoming messages at the data warehouse, both update messages and 
replies to probes, are kept in a Message List, and each message remains in the 
list until it has been completely processed. In the case of a probe reply, this 
completion may be more or less immediate, but it is helpful for such replies 
to reside in the Message List in order to keep track of their sequence of arrival 
with respect to other messages, because the message sequence then identifies the 
updates that are conflicting. In the case of update messages, completion occurs 
after all needed probes have been sent and replies have been processed, and 
when subsequently the update message is at the head of the Message List. At 
that time the computation of the appropriate change to the materialized view 
is complete, and the change is installed. The imposed sequence of installation of 
changes is thus the same sequence in which the update messages arrived, and 
can easily be seen to be the same order in which the same changes would be 
installed in the correctness model. The proof of correctness is omitted here, but 
follows directly from the order of installation of the updates and the correctness 
of the message compensation, developed in [6]. 
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Fig. 4. The Computational Model 



All messages at the data warehouse are handled by the Strategy, accord- 
ing to the warehouse and source schemas. In our implementation, the Strategy 
comprises schemas and query plans contained in a JavaBean whose superclasses 
provide the computational infrastructure which defines whether we are execut- 
ing STROBE[5], SWEEP [6], or an Optimized Posse algorithm as discussed in 
Sect. 5.3. 

Eor example, when implementing SWEEP or related algorithms, this infras- 
tructure closely follows the local compensation method of [6]. This method rec- 
ognizes that updates comprise insertions and deletions, and that these are com- 
mutative operations. It therefore turns out that by retaining update messages 
in the Message List, the data warehouse has preserved sufficient information to 
reconstruct replies to queries “as if” the queries had been issued and answered 
atomically at the same point in the message stream where the update arrived. 
Accordingly the data warehouse reconstructs the replies to queries that would 
have been received in the Correctness Model of Sect. 4.1. Since the probe replies 
must be capable of compensation for interfering updates, we refer to them as lo- 
cally compensatahle. This is related to the property that qualifies a view as being 
self-maintainable, but the domain of applicability is different. The full version of 
this paper [7] discusses some of the complexities of mapping the space of such 
queries, as well as the details of how aggregate functions and HAVING clauses 
may be implemented in our framework. 

The computational model as presented here can achieve both complete con- 
sistency and the maximum possible concurrency. We call this approach the Con- 
current Posse. The concurrency is achieved by installing a Strategy that, in 
service of each update, simply queries all data sources for their entire content, 
applying the view query to the compensated replies to compute a new material- 
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ized view. These queries can be issued concurrently for all updates and all data 
sources, as each update message arrives. The compensation algorithm assures 
consistency while the restriction of installing views in the order of the arrival of 
their corresponding updates assures the completeness of the consistency. 

The Concurrent Posse algorithm operates in the computational model to 
define an extreme of concurrency at the cost of large message sizes, so that a 
spectrum of solutions can be realized between Concurrent Posse on the one hand 
and the existing algorithms [4-6] on the other. That such an extreme approach is 
feasible demonstrates the flexibility of the framework proposed here, and suggests 
the possibilities that lie between extremes. 

5 The Framework 

The Concurrent Posse algorithm may be impractical because of the message sizes 
that would result from asking for the transmission of all base relations in response 
to each update. On the other hand, if the data warehouse is going to send all 
queries immediately on receiving the update message, there may be insufficient 
information at hand to refine the query at that moment. This presents us with 
an opportunity to trade off two contributors to overall efficiency: concurrency 
and message size. 

The Posse framework and the Optimized Posse algorithm that we present in 
this section, operating in the computational model we have already presented, 
provides a framework in which such tradeoffs can be conveniently exploited. 
We impose a conceptual structure on the schema and update processing, using 
rooted directed graphs to represent the organization of the algorithms we use. 

5.1 Schema and Rooted DAG 

At the data warehouse the Strategy associates a query plan with each data 
source, to be used for each update message received from that data source. Each 
query plan represents a schedule for issuing probes in response to an update 
message from a particular data source. We present each such query plan as a 
rooted directed acyclic graph (rooted DAG), with one vertex for each occurrence 
of a relation in a FROM clause in the view query. 

Each non-root vertex is associated with a query to be used to probe the 
corresponding data source for the data relevant to the use of that instance of the 
relation in the processing of the query. Each vertex is labeled; the same name is 
used for any update message, probe or data associated with that vertex. Edges 
represent a dependency that can be expressed as “is completed before composing 
the probe of,” and indicate that the probe for the destination vertex is delayed 
until the data from the source of that edge is available. 

In general there is also a final query over the original update message and 
the probe responses, which produces the set of changes to be installed in the 
materialized view. We assume that this query is executed at the data warehouse 
after all probe replies have been received and corrected for interfering updates. 
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For Concurrent Posse the DAG is star-shaped: it contains an edge from the root 
to each leaf; this is our usual starting point for developing the final DAG. 

We concern ourselves in this section with optimizations that correspond to 
adding or deleting one or more edges to the star-shaped graph for a query plan 
and choosing an appropriate query to associate with the destination vertex. 
Edges are deleted when the probe at the destination node has been combined 
with a probe at another node for the same data source. Added edges reduce the 
concurrency in the query plan by delaying the probe at the destination vertex 
until completion of the probe at the source vertex. The advantage in doing this 
derives from the possibility of greatly reducing message size, local processing cost 
or both. Properly chosen, these optimizations can improve query performance 
at the data source and message size in the response at the expense of loss of 
concurrency in processing. 



5.2 Optimized Update Processing 

In planning the maintenance of the DoubleStatus view one can apply various 
transformations to the query plan, as might be done by a query optimizer. We 
start with a “star-shaped” graph with edges from the root directly to each of 
the leaves and consider transformations that can be represented using the graph 
and associated probes. 

These transformations are used in the graphs of Fig. 5, which corresponds to 
the example introduces in Sect. 3. First, each edge in the graph can be associated 
with a join involving information from the source node of the edge, and any 
previous nodes. Compared to simply requesting the entire content of the relation, 
this probe can (vastly) reduce message size. 




Fig. 5. Directed graphs for optimized query plans for the DoubleStatus view. Graph A 
is used for updates from the Teaches relation. Graph B for updates from Assists, and 
Graph C for updates from Enrollment. 



Second, a probe can be delayed as seen in Fig. 5A at node Ea3; this probe 
of the Enrollment relation is delayed until the reply to the previous probe is 
available at the warehouse, providing information that can be used in the join. 

Third, in a few cases multiple probes of a single relation can be combined, as 
seen in Fig. 5C at node Tlc3 and T2c3. Node Tlc3 has actually been bypassed 
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because appropriate manipulation of the reply to the probe T2c3 can reconstruct 
the reply that would be received for Tlc3. This optimization applies because the 
join between the two nodes is an equijoin with corresponding attributes. 

Fourth, there is no query plan rooted at T2 because we have applied another 
optimization. In the general case without this optimization, we would “echo” 
updates from Teaches to that there would appear to be two data sources T1 
and T2 whose updates “just happen” always to be identical and adjacent in the 
Message List. 

The transformations chosen in these examples may or may not be the best in a 
given case. The combination of Tlc3 and T2c3 may trade message complexity for 
computational complexity. Moreover, it may be better to use one of the variants 
of semijoin [8] for the other joins, and it may be better to choose different or 
additional dependency edges to add to the graph. Our purpose here is not the 
optimal choice of transformations but the description of a framework in which 
transformations can be applied to the algorithm for view maintenance. In any 
event, the goal of optimization is often achieved by the avoidance of very bad 
solutions, and it appears that this goal is achieved in these examples, and that 
a high degree of concurrency is maintained. 



5.3 The Optimized Posse Algorithm 

The Posse framework allows query plans that schedule concurrent probes to 
perform incremental view maintenance. We presume that the query plans are 
the output of an optimizer process which runs when the view query is defined. 
The Posse framework and its DAGs formalize the notion that the probes are 
initiated in a particular order rather than all at once as in Concurrent Posse. 
The opportunity for optimization lies in the fact that having information from 
prior probes and using in a join or semi-join operation may reduce the size of 
subsequent replies. 

This algorithm operates by creating a thread of control for each update mes- 
sage, and creating and deleting additional threads as the DAG branches and 
joins, following the structure of the DAG associated with the source of the up- 
date message for which the delta to the view is being computed. As subpaths 
are completed, their partial results are assembled until the correct delta to the 
view has been computed. These deltas may be installed in the materialized view 
in order according to the delivery order of their update messages to achieve 
complete consistency [4] with the Correctness Model. Batching of updates can 
be used, in which case strong consistency is maintained. Variant commit orders 
may also be used where convergence is acceptable behavior. 

6 Conclusion 

We have presented a correctness model for evaluating the consistency and cor- 
rectness of algorithms for the transport of base relation information for the 
incremental maintenance of a materialized view in a data warehouse. We have 
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shown that a completely concurrent algorithm (Concurrent Posse) obtains com- 
plete concurrency of the probing process, albeit at the cost of increased message 
sizes. We have presented in the Posse framework in which distributed query op- 
timizations can be applied to the Concurrent Posse approach in order to obtain 
reasonable performance of the incremental maintenance process, and have indi- 
cated areas in which further improvements might be obtained. Throughout, we 
have maintained the criterion of complete consistency with a correctness model 
that honors the order of arrival of update messages from the data sources. Our 
present approach uses fixed query plans determined in advance. In the future we 
plan to investigate dynamically determining the query plan and to explore alter- 
nate optimization criteria for choosing the query plans, and to fully characterize 
locally compensatable probes. 
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Abstract. Data Warehouse applications use a large number of material- 
ized views to assist a Data Warehouse to perform well. But how to select 
views to be materialized is challenging. Several heuristic algorithms have 
been proposed in the past to tackle with this problem. In this paper, we 
propose a completely different approach. Genetic Algorithm, to choose 
materialized views and demonstrate that it is practical and effective com- 
pared with heuristic approaches. 

1 Introduction 

Data Warehousing is an in-advance approach to the integration of data from 
multiple, possibly very large, distributed, heterogeneous databases and other 
information sources. A Data Warehouse (DW) can be viewed as a repository of 
materialized views of integrated information available for querying and analysis. 

The problem of selecting materialized views to efficiently support data ware- 
house application is proven to be a NP-complete [4]. Most of the work [9], [1], [6], 
[7], [3], [10], done in this area use heuristics to select materialized views in order 
to obtain a near optimal solution — the minimum sum of query response time 
and view maintenance time under some constraint. The ’’Greedy Algorithm” 
was usually exploited to select the materialized views in [5], [3], [4]. 

The algorithm used in optimization can be classified into four types [8], [2]: 
Deterministic algorithm. Randomized algorithm, Genetic algorithm (GA), Hy- 
brid algorithm. GAs are different from many normal optimization and search 
procedures in three ways: working with a coding of the parameter set, searching 
for a solution from a population of points, nsing probabilistic transition rules to 
search the problem space. There are two issues we should consider. One is the 
representation transformation. GAs work on bit strings. However, the material- 
ized view selection problem of interest here is usually represented as a directed 
acyclic graph (DAG) to which the GA cannot be directly applied. Another issue 
occurs because small change to the string in GA can produce large change to 
the DAG in our problem, sometimes these changes may produce invalid result. 

The rest of the paper is organized as follows: Section 2 gives an example to 
illustrate briefly the problem for the selection of materialized views. In Section 
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3, we present the GA for the selection of materialized views. Section 4 presents 
the experimental results using our evolutionary approach, in Section 5 concludes 
by analyzing and summarizing our experiments results 

2 Materialized View Selection Problem 

2.1 Motivating Example 

In this section, we first present an example to motivate the discussion of selec- 
tion of materialized views in DW. The discussion is presented in terms of the 
relational data model with select, join, and aggregate operations. Our examples 
are taken from a DW application which analyzes trends in sales and supply, and 
which were used in [10]. The relations and the attributes of the schema for this 
application are: 

Item(I_id, I_name, I_price) 

Part(P_id, P_name, I_id) 

Supplier (S_id, S_name ,P_id, City, Cost, Preference) 

Sales (I_id, Month, Year, Amount) 

There are five queries, as follows: 

Ql: Select P_id, min(Cost) , max(Cost) 

From Part , Supplier 

Where Part.P_id=Supplier.P_id 

And P_name in {"spark_plug" , "gas_kit" } 

Group by P_id 

Q2: Select I_id, sum(amount*number*min_cost) 

From Item, Sales, Part 

Where I_name in {"MAZDA" , "NISSAN" , "TOYOTA" } 

And year=1996 
And Item. I_id=Sales . I_id 
And Item. I_id=Part . I_id 
And Part.P_id= 

(Select P_id, min(Cost) as min_cost 
From Supplier 
Group by P_id) 

Group by I_id 

Q3: Select P_id, month sum(amount) 

From Item, Sales, Part 

Where I_name in {"MAZDA", "NISSAN", "TOYOTA" } 

And year=1996 
And Item. I_id=Sales . I_id 
And Part . I_id=Item. I_id 
Group by P_id , month 

Q4: Select I_id, Sum(amount *I_price) 

From Item, Sales 
Where I_name in {"MAZDA", " 



NISSAN", "TOYOTA"} 
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And year=1996 

And Item. I_id=Sales . I_id 

Group by I_id 

Q5: Select I_id, avg(amount*I_price) 

From Item, Sales 

Where I_name in {"MAZDA", "NISSAN", "TOYOTA"} 
and year=1996 
and Item. I_id=Sales . I_id 
Group by I_id. 



Figure 1 represents a possible global query access plan for the five queries. 
The local access plans for the individual queries are merged based on the shared 
operations on common data sets. This is called a Multiple View Processing 
Plan (MVPP) in [10]. 



2 




Fig. 1. A motivating example 



Now the problem we are dealing with is how to select the views to be mate- 
rialized so that the cost of query processing and view maintenance for the whole 
set of nodes in the MVPP is minimal. 

An obvious approach is to apply the exhaustive algorithm for materialized 
view selection on the set of queries. However this approach is very expensive 
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if the search space is big. Many researchers have therefore applied heuristics 
to trim the search space in order to get the results quickly. However only a 
near-optimal solution can be achieved. In order to avoid exhaustive searching 
the whole solution space, and to obtain a better solution than that obtained by 
heuristic, we exploit the GA to deal with this problem. 

In the next section, we will present a brief specification of the cost model, 
then we will explain what the GA is and how to apply it to our problem. 

2.2 Specification of Cost Model 

An access plan is a labeled DAG(G, A, C'®(u),C'^(u), /,,/„) where V is a set 
of vertices, A is a set of arcs over V. C'^(u), CJ^(v) are costs, and /,, /„ are 
frequencies. The cost model is constructed as follows: 

1. For every relational algebra operation in a query tree, for every base re- 
lation, and for every distinct query, create a vertex; 

2. For V & V , T(u) is the relation generated by corresponding vertex v. T(u) 
can be a base relation, intermediate result while processing a query, or the 
final result for a query; 

3. For any leaf vertex v (that is one which has no edges coming into the 
vertex), T(u) corresponds to a base relation. Let L be a set of leaf nodes. 

4. For any root vertex v (that is one which has no edges going out of the 
vertex), T(u) corresponds to a global query. Let R be a set of root nodes. 

5. If the base relation or intermediate result relation T(u) corresponding to 
vertex u is needed for further processing at a node v, introduce an arc 
u — > v; 

6. For every vertex v, let S{v) denote the source nodes which have edges 
pointed to v, For any v £ L, S{v) = 0 . Let S'*!!;} be the set of descendants 
of V] 

7. For every vertex v let D(u) denote the destination nodes to which v is 
pointed; For any v £ R, D{v) = 0; 

8. For V £V , C^(v) is the cost of query g accessing T(u); CJ^(v) is the cost 
of maintaining T(u) based on changes to the base relation S*{v) f) -Rj if 
T(u) is materialized. 

9- fqifu denote query frequency and base relation maintenance frequency 
respectively. 



3 Genetic Algorithm for Materialized View Selection 

Since GA simulates the biological process, most of the terminology is borrowed 
from biology. A detailed illustration of GA terminology can be found in [2]. One 
of the differences between GA and other commonly used techniques is that GA 
operates on population of strings, not a single string. Every population is called a 
generation. A single solution is called a Phenotype and is represented by a single 
string. Solutions are represented as strings (ehromosomes), that are composed 
of eharaeters (genes) that can take one of several different values (allels). GA 
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begin 

Generate initial population, G(0); 

Evaluate G(0); 

t:=0; 

repeat 

t:=t+l; 

generate G(t) using G(t-l); 
alter G(t); 
evaluate G(t); 
until solution is found; 
end; 



Fig. 2. An abstract view of Genetic Algorithm 



creates an initial generation, G(0), and for each generation, G(t), generates a 
new one, G(t+1). An abstract view of the algorithm is shown in figure 2. 

Each problem should have its own solutions represented as character strings 
by an appropriate encoding. Selection, crossover and mutation are three opera- 
tors applied to successive string populations to create new populations. In other 
words these three operators are applied on G(t-l) to generate G(t) as shown in 
Figure 2. Choosing a fitness function is important. Fitness is used in evaluating 
individual G(t). In GA, the average fitness and the fitness of the best solution 
increases with every new generation. In order to get the best solution, a lot of 
generations should be evolved. Several stopping criteria exist for the algorithm. 
For example, the algorithm may be halted when all solutions in a generation are 
identical. 

In this paper, we devise our GA based on the principle of Simple GA de- 
scribed in [2]. With some modification on the policy of selection and fitness, we 
propose the following version of GA which is suitable for our problem. 

3.1 The String Representation of Onr Solntion 

Based on the principle of minimal alphabets of GA coding, the string is essen- 
tially a binary string of ones and zeroes. 

The representation of our problem is a MVPP, which is a DAG rather than 
a binary string. If we can map the representation from DAG to a binary string, 
we can apply GA to our problem. 

The mapping strategy is shown in figure 3. 

For example, search through the DAG in figure 1 using width-first, we obtain 
the mapping array as follows { [Q5,0], [Q4,0], [Q3,0], [Q2,0], [Q1,0], [result5,0], 
[resultl,0], [result2,0], [result!, 0], [result3,0], [tmp9,0], [tmp3,0], [tmp4,0], [tmp8,0], 
[tmp7,0], [tmpl0,0], [tmpl,0], [tmp2,0], [tmp5,0], [tmp6,0]}, its length is 20 ex- 
cluding the 4 source tables { Item, Sales, Part, Supplier}. 

Suppose the result of GA is {0,1, 0,0, 1,1, 0,0, 0,0, 0,0, 0,0, 0,0, 0,1, 1,1). That means 
that the nodes{Q4, Ql, resultb, tmp2, tmp5, tmp6) should be materialized. 
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begin 

1. Input a MVPP represented by a DAG; 

2. Use a certain graph search strategy such as breadth-first, width-first or 
problem-oriented searching method to search through all of the nodes in the 
DAG and produce an ordered sequence of these nodes 

3. Based on this sequence of nodes, create a two dimensional array to store 
the sequence of nodes and strings of Os and Is. One dimension is for the 
sequence of nodes, another dimension is for strings of Os and Is. Of the 
strings of Os and Is, 0 denotes that the corresponding node in the array, 
indexed by the same subscript, is unmaterialized. 1 representes 

the corresponding node in mapping array is materialized. 

This array is called the mapping array. 



Fig. 3. A mapping strategy 



3.2 Mapping Cost Function in Our Problem to Fitness Function in 
Genetic Algorithm 

The objective in our cost model is stated as the minimization of the sum of 
query cost and maintenance cost, while the objective or fitness function of GA is 
naturally stated as maximization. Therefore, there should be a transformation 
from our cost function to the fitness function in GA. For example. 

The commonly used transformation in GA is as follows: 

ff i = / ^ ^ max 

\o otherwise 

c{x) denotes the cost function. There are a lot of ways to choose the coefficient 
Cmax- Cmax may be taken as an input coefficient, as the largest c(x) value 
observed so far. , as the largest c{x) value in the current population, or the 
largest of the last k generations. 

3.3 Crossover 

The crossover operator is a way of random number generation, string copies and 
swapping partially good solutions in order to get a better result. 

For example, there are two strings from our example: 

Li = 1100100|0100100001111 L2 = 0100110|1011000100111 
Li means that nodes {Q5, Q4, Ql, resultd, tmp3, tmpl, tmp2, tmp5, tmp6} 
are materialized. L 2 means that nodes {Q4, Ql, resultb, result2, results, tmp9, 
tmp7, tmp2, tmp5, tmp6} are materialized. Suppose k is chosen from 1 to 20 
randomly. We obtain a k=7 (The symbol | represents the position of crossover 
applied). The results of the crossover are two new strings: 

L\ = 1100100|1011000100111 L 2 = 0100110|0100100001111 
The two new individuals, Lj means that nodes {Q5, Q4, Ql, result2, results, 
tmp9, tmp7, tmp2, tmp5, tmp6} are materialized and L 2 means that nodes {Q4, 
Ql, results, resultd, tmp3, tmpl, tmp2, tmp5, tmp6} are materialized. 
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3.4 Mutation 

The mutation operator is a means of the occasional random alteration of the 
value of a string position. It introduces new features that maybe not present 
in any member of the population. The mutation is performed on a bit by bit 
basis. For example, assume that the 16th gene from the individual Li =11 
001000100100001111 is selected for a mutation. Since the 16th bit in this string 
is 0, it would be flipped into 1 with a probability equal to the mutation rate. So 
the individual Li after this mutation would be 
L\ = 11001000100100011111 

3.5 Selection 

The mechanics of a simple GA involve nothing more complex than copying 
strings and swapping substrings. The selection operator is a process in which 
strings are copied according to their fitness function. The string with higher 
fitness value has a higher chance to survive. It is used to select the good solutions 
in the population. 

3.6 Modifying the Selection Policy 

As mentioned before, GA can be applied to search the solution space. However 
because of the random characteristic of crossover and mutation, after crossover 
and mutation, some strings in the next generation might be “invalid” , i.e. some 
nodes which cannot be materialized because some relevant nodes have been 
materialized. 

In the following, we identify one rule to prevent some nodes being selected 
to be materialized. 

Rule 1: vl is a parent of v2 and v2 has the same ancestors (excluding vl) as 
vl . We can prove that after vl was materialized, there is no need to materialize 

v2. 

Proof: To illustrate this, see figure ??. Let that Cq.{M) be the cost to 
compute Qi from the set of materialized views M. 

If vl and v2 are materialized, then the total cost is: 

Cl = E9GO.J fg(<l)*<^gM + Egeo„, fg(<l)*<^gM + Erei., fu(r)*Ci:^{vi) + 
Y.rei.^ fn{r)*C;^{v2) ••••(1) 

As vl and v2 have the same parents, then (1) can be changed to: 

Cl = J2geo„, fgig)*(Cg{vi)+Cg{v2))+J2rei.^ fu{r)*Ci:^{vi)+J2rei.^ 

Since vl is materialized before v2, v2 cannot be reached by any queries, 
Cg{v 2 .) = 0, then (1) becomes: 

Cl = EgGO.j fg(<l)*Cl{vi) + Y.reh, + fu{r)*C^{v2) 

....( 2 ) 

On the other hand, if we only materialize vl, the total cost is: 

C2 = E9G0.J fgil) * c^(i'i) + ErG/.i * c;;,(ui) 

Since vl, v2 are materialized, we can conclude that the benefit of material- 
izing vl and v2 is greater than the benefit of materializing vl, in other words, 
C2 >Gi. 
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From C 2 > Cl, we obtain fu{r) * CJ^{v 2 ) < 0. This is impossible. 

Therefore in the presence of materialized view vl, we cannot materialize v2 
under the condition mentioned above. 

For example, with respect to the MVPP in Figure 1, after crossover and mu- 
tation, we get string {00000000001100101100}, this string means that {tmp3, 
tmp2, tmp7, tmplj should be materialized. However since {tmp3, tmp9) are 
the parents of {tmpl, tmp2}, they have the same parents, then this string is 
“invalid”. This means that the cost for this string {00000000001100101100} is 
greater than that for string {00000000001100100000} in which {tmpl,tmp2} 
are unmaterialized. 

How to solve this problem? There are many approaches. One solution is to 
add a constraint on crossover and mutation operators. This approach gets rid of 
the ’’invalid” strings completely while results in the fast convergence to a local 
minimum. 

An alternative approach is to relax the definition of validation, and include 
a penalty in the cost function to ensure that “invalid” solutions are expensive. 
But the design of the penalty function is somehow dependent on the experienced 
values. 

We propose the algorithm shown in figure 4 which is based on the combination 
of the selection principle of GA and rule 1 to do postprocessing. 



begin 

1. Obtain the initial solutions by GA; 

2. By rule 1, Repeat for each solution within these initial solutions, 
if there are parent and child nodes as follows: 

1) they have the same parents. 

2) they are materialized. 

then change the child nodes to be unmaterialized, to re-calculate the total cost, 
if this new total cost is less than the initial one 
then replace the initial solution with this new solution, 
until every initial solution is checked. 



Fig. 4. A revised algorithm 



4 Experiment 

Our experiment is built on the basis of the Simple Genetic Algorithm. Based 
on the Simple Genetic Algorithm program [?] which is a C-language translation 
and extension of the original Pascal SG A code presented in [2] , we developed our 
implementation on SUN-OS V5. General experience shows that the probability 
of mutation should be much less than that of crossover. In our experiment, the 
probability of crossover is 0.9, for mutation is from 0.005 to 0.1. The maximum 
number of generations should be at least double of the population size. Figure 5 
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shows the comparison between the heuristic algorithm and Genetic Algorithm. 
The costs are normalized using heuristic algorithm as the reference. The heuristic 
algorithm used in [10] is a heuristic which is analogous to the Greedy Algorithm. 
To produce the results shown in figure 5, randomly produced 50 queries. The 
number of source relations involved in each query varies from 3 to 8. The nodes 
of the DAG varies from 24 to 200. 
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Fig. 5. Experimental results 



From the results shown in figure 5, we can conclude that with respect to small 
number of the queries, G A works significantly better than the heuristic approach. 
When query number increases, the GA approaches closely to the heuristic. How- 
ever since the total cost increases hugely, a little difference in the comparison of 
the results will result in a large amount of total cost saving. 

With respect to performance, the heuristic needs a few seconds while the GA 
usually finishes within several minutes. 

5 Conclusion 

In this paper, we proposed a GA to deal with the selection of materialized 
views in DW. Due to the nature of the problem, we have shown that a GA is 
particularly suitable. We have also shown the representation of our problem in 
GA and three basic operators used: selection, crossover, mutation. Based on the 
principle of a Simple GA, we defined the fitness and cost function transformation 
and developed a modified policy of selection. Finally we modified the GA to 
cater for the “invalid” solutions. We have demonstrated that a GA is a feasible 
approach towards solving materialized view selection problem. 

In this paper, we only considered a given MVPP. For a set of queries, normally 
there are lots of possible MVPPs. We have applied GA to this problem. In the 
future we will explore the possibility of using Genetic Programming to select the 
best MVPP from all the possible MVPPs and materialized views. We will also 
apply this method on a large number of nodes to test different scenarios. 





Genetic Algorithm for Materialized View Selection in Data Warehouse Environments 



125 



References 

1. Elen Baralis, Stefano Paraboschi, and Ernest Teniente. Materialized view selection 
in a multidimensional database. Proceedings of the 23rd VLDB Conference, Athens, 
Greece, pages 156-165, 1997. 

2. D.E. Goldberg. Genetic algorithms in search, optimization and machine learning. 
Addison Wesley, Reading(MA), 1989. 

3. Himanshu Gupta. Selection of views to materialize in a data warehouse. Proceed- 
ings of the International Conference on Data Engineering, Burmingham, U.K., 
pages 98-112, April, 1997. 

4. Himanshu Gupta and Inderpal Singh Mumick. Selection of views to materialize 
under a maintenance cost constraint. Proceedings of the International Conference 
on Data Engineering, 1998. 

5. Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. Implementing 
data cubes efficiently. ACM SIGMOD International Conference on Management 
of Data, pages 205-227, 1996. 

6. Wilburt Juan Labio, Dalian Quass, and Brad Adelberg. Physical database de- 
sign for data warehouses. Proceedings of the International Conference on Data 
Engineering, pages 277-288, 1997. 

7. K.A. Ross, Divesh Srivastava, and S. Sudarshan. Materialized view maintenance 
and integrity constraint checking: Trading space for time. Proceedings of the ACM 
SIGMOD, pages 447-458, 1996. 

8. Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper. Heuristic and random- 
ized optimization for the join ordering problem. VLDB, 6(3):191-208, 1997. 

9. Dimitri Theodoratos and Timos Sellis. Data warehouse configuration. Proceedings 
of the 23rd VLDB Conference Athens, Greece, 1997, pages 126-135, 1997. 

10. Jian Yang, Kamalakar Karlapalem, and Qing Li. Algorithm for materialized view 
design in data warehousing environment. VLDB ’97, pages 20-40, 1997. 




Optimization of Sequences of Relational Queries 
in Decision-Support Environments 



Antonio Badia and Matthew Niehues 



Computer Science and Computer Engineering department 
University of Arkansas 
Fayetteville AR 72701 
E-mail: abadia@godel.uark.edu 



1 Introduction 

In this paper, we analyze collections of SQL queries which together answer a 
user’s question for which no single SQL query can compute the solution. These 
collections usually define a series of views or temporary tables that consti- 
tute partial solutions to the question, and finally use an SQL query on those 
views/tables to get the final answer. We argue that this situation poses prob- 
lems for traditional approaches to optimization. We show that many of these 
collections of queries follow some patterns and argue that the class of queries 
covered by such patterns is relevant for practical purposes. We show a way to 
implement these collections in an efficient manner. We have carried out experi- 
ments with the TPC-D benchmark in order to test our approach. 



2 The Problem and Related Work 

Some common business questions cannot be expressed with a single query in 
SQL ([!]). In this case, one can define a series of views or temporary tables that 
constitute partial solutions to the question, and finally use an SQL query on 
those views/tables to get the final answer. Thus the final result is constructed 
in a step-wise fashion. 

Example 1. In our examples we will assume a relation R with attributes A, B, 
C, D. The question “Give the smallest maximum value of C assoeiated with a 
given B” ean be answered by first ereating a view as follows: 

CREATE VIEW V(B, MYMAX) AS SELECT B, max(C) FROM R GROUP BY B 
and then querying the view: 

SELECT * FROM V WHERE MYMAX = (SELECT min (MYMAX) FROM V) 

By looking at the whole sequence, it is clear that the extra work involved 
in computing view V separately is not necessary. The problem is that SQL does 
not allow for eomposition of aggregate funetions, as each aggregate is related 
to a GROUP BY clause (as max in the example) or to a subquery (as min in the 
example). There are other reasons why SQL may be unable to compute a solution 
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with one single query, like having the aggregate computation depend on some 
condition. 

There is a long body of research in query optimization, as the problem 
has tremendous practical importance. However, optimization of groups of SQL 
queries is not a topic that has been explored as much as single query optimiza- 
tion. The recent interest in data warehousing and decision support has motivated 
work on the use of views to solve complex queries in a step-by-step fashion, de- 
composing the problem into simpler ones ( [2] , [4] ) . Most of this work is focused on 
view maintenance and does not address the specific problem we present, which is 
the inability of the language to express some queries as one-step computations. 
Some work ([6], [8]) concerns optimization of multiple queries. [6] develops the 
idea (already present in past research) of discovering common subexpressions 
among groups of queries, executing the subexpressions once and storing the re- 
sults for reuse. This approach takes as input the query plan for each query, not 
the query itself. The query plan produced may be optimal for the query, but 
suboptimal from the point of view of optimizing the whole sequence (in particu- 
lar, some common subexpressions may be gone). To overcome this, it is proposed 
to consider several query plans per query. The approach of [8] allows queries to 
be broken down into subqueries for more flexibility in finding common subex- 
pressions. A heuristic approach is proposed when considering all plans, since the 
search state may become very large. Both the work of [6] and [8] is limited to 
Select-Project-Join queries, without subqueries or aggregation. We note an im- 
portant difference in coverages and goals between these approaches and ours: [6] 
and [8] try to optimize arbitrary groups of queries to improve system’s through- 
put, while we look at particular sequences of queries: each subquery represents 
a different step (partial solution) towards the ultimate goal. 

The work of Ross and Chatziantoniou in [7] has many similarities with the 
research developed here. [7] observes that the constraints of SQL do not allow 
the language to express many aggregate-based queries in a single query. An 
extension to SQL which would allow said queries to be expressed as a single 
query is proposed. An operator for the relational algebra is given that translates 
the SQL extension, and an algorithm to evaluate the new operator provided. 
Thus [7] is basically attacking the same problem that we deal with here. The 
algorithm proposed also seems to coincide with the for-loop programs that we 
provide as solutions to the examples, in that relations are first grouped and then 
aggregations are calculated in one pass, possibly over different groups. Thus, 
even though the solutions proposed here and in [7] are of a different nature, the 
end result seems very similar. An important difference is that the work reported 
here takes as input the original SQL and therefore requires nothing from the 
user, while [7] requires the user to rewrite the SQL query. 

3 For-Loop Optimization 

We represent SQL queries in an schematic form. With the keywords SELECT . . . 
FROM . . . WHERE we will use L, Li, L 2 , ■ ■ ■ as variables over a list of attributes; 
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T,Ti,T 2 ,... as variables over a list of relations, F,Fi,F 2 ,... as variables over 
aggregate functions and A, Ai,A 2 , ... as variables over lists of conditions. At- 
tributes will be represented by attr,attri,attr 2 , We will say that a pattern 

matches an SQL query when there is a correspondence g between the variables 
in the pattern and the elements of the query. 

Example 2. The pattern 
SELECT L FROM T WHERE Ai 

AND attri = (SELECT F(attr 2 > FROM T WHERE A 2 ) 
would match queries with embedded subqueries that contain an aggregate function 
(and nothing else) in the SELECT clause. Recall relation R(A,B,C,D) and let 
S(E,F,G) be another relation with E a foreign key referring to A. The SQL query 
SELECT * FROM R.S WHERE R.A = S.E and R.B = »c’ and S.F = ’d' 
and C = (SELECT max(C) FROM R.S 

WHERE R.A = S.E and R.B = 'c' and R.D = ’e') 
matches the pattern above with the correspondence g{Ai) = {R.A = S.E, R.B =' c', S.F =' d'}, 
g{A 2 ) = (R.A = S.E, R.B =' c', R.D =' e'}, (?(T) = {R,S}, g{?) = max andff(attn) = 
g(attr2) = C. 

We use these patterns to create for-loop programs. A for-loop program is an 
expression of the form 

for (t in R) GROUP (t.attr, Body 1) [Body2] BodyS \Body4\ 
where t is a tuple variable (called the driving tuple), R is a relational algebra 
expression called the basic relation, and each one of Bodyl. . .Body4 is called a 
loop body. A loop body is a sequence of statements, where each statement is 
either a variable assignment or a conditional statement. We write the assign- 
ments as v : = e ; , where v is a variable and e an expression. Both variables and 
expressions are either of integer, tuple or relation type. Expressions are made 
up of variables, constants, arithmetic operators (for integer variables) and the U 

operator (for relation variables). If ei e„ are either integer expressions or 

attribute names, then (ei , . . . ,e„) is a tuple expression. If u is a tuple expres- 
sion, then (u) is a relation expression. Conditional statements are written as: if 
(cond) pi; or: if (cond) pi else p2 ;, with both pi 2 ind p2 being sequences 
of statements. The condition cond is made up of the usual comparison operators 
(=, <, > and so on) relating constants and/or variables. Parenthesis ({, }) are 
used for clarity. Also, for-loop programs obey the following constraints: first, the 
basic relation is built using only the join, project and select relational operators 
applied to base relations from the database. Second, the only tuple variable in 
the loop body is the driving tuple and the only relational variable is an special 
vEiriable called result. The semantics of a for-loop program are defined in an 
intuitive way. Let attr be the name of an attribute in R, and let ti , . . . , be 
an ordering of the tuples of R such that for any i,j 6 {1, ... ,n}, if tj.attr = 
tj . attr, then i = / -I- 1 or j = i + 1 (in other words, the ordering provides a 
grouping of R by attribute attr). Then the program Bodyl is done once for each 
tuple, and all variables in Bodyl are reset for different values of attr (that is, 
Bodyl is computed independently for each group), while progr£un Body2 will he 
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executed once for every value of attr (i.e. once for every group) after Bodyl is 
executed. BodyS is simply done once for each tuple in R and Body4 is executed 
once, after the iteration is completed. The semantics of executing a program are 
intuitive. 

Example 3. The SQL query 

select B, avg(C) from R where A = ’oi’ group by B 
can he computed by the program 

count : = 0 ; sum : = 0 ; avg : = 0 ; result : = 0 ; 

for (t in TrB,c((^A^'a[(R))) 

GROUP (t.B, {sum := sum + t.C; count := count + 1}) 

[avg := sum/count; result := result U {(t.B, avg)};] 

This example has neither a BodyS nor a Body4 fragment. Observe that it is as- 
sumed that variables sum and count get reset to their initial values for each 
group, while avg and result are global variables, and the instructions that con- 
tain them are executed only once for each group ( once sum and count have been 
computed). 

In order to build a for-loop program we need to provide a basic relation and 
code for Body. This is derived from the pattern that the SQL query matches. 
The following example gives an intuitive view of the process. 

Example f. Assume the pattern and the query of example 2. We then proceed as 
follows: the basic relation is built using g(Ai)r]g(A 2 ) (in this case, {R.A = S.E, 
R.B = ’c’}} applied to g(T) (in this example, {R, S}/ Thus the basic relation 
is IX S) (after some basic relational optimization). Eor the body of 

the loop, we define a piece of code associated with the pattern. In this case, the 
code is 

max := — oo; result := 0; 

if giA^) — g{Ai) update aggregate (max) and update result; 
if g{Ai) — g{A 2 ) compute result; 

where result and max are global variables. The concrete code depends on the 
linking operator and the linking function (“=” and max in example 2), since 
compute aggregate, compute result and update result are macros that must 
be further developed. The concrete code explains how to compute the condition 
in an iterative fashion. Once the particular code for “= ” and max is added, the 
pattern for the query of example 2 expands into the following program: 
max := — oo; result := 0; 
for(t in aR,B^'c'(R ^ S)) 
if (t .D = ’ e ’ ) 

if (t . C > max) { max : = t.C; result : = 0 ; } 
if (t.F = ’d’) { 

if (t.C = max) result := result U {t}; 
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Given a question expressed as a set of queries, optimization as a group is 
only possible if the set of queries is presented to the optimizer as a package, so 
that the optimizer C 2 m examine them together as a unit. In that situation, it is 
possible to detect the relationship among views (tables) by looking at the FROM 
clauses. In some cases it may be possible to collapse several steps into a single 
one. 

Example 5. The query of example 1 has two parts. The first one is a view cre- 
ation query which fits the pattern of example 3. The second one is a query which 
follows the pattern of example 2. The relation is established by the use of name 
V (the view created in the first part) in the FROM clause of the second part. Both 
patterns can be combined as follows: create a new pattern following example 3, 
and insert, inside the [] part of the body, the code introduced by example 2. The 
final program for the query of example 1 is: 
result := 0; curr-max ;= — oo; curr-min := +oo; 
for(t in 7TB, c 00) 

GROUP (t.B, 

if (t.C > curr-max) { result:* {(t.B, t.C)}; curr-meuc := t.C;} 
if (t.C = curr-max) { result:* result U {(t.B, t.C)};) 

[if (t.C < curr-min) { result:* {t}; curr-min :* t.C;} 
if (t.C = curr-min) { result:* result U {t}; }] 

Recall that the code between [ and ] is done once for every group, after the code 
in the GROUP construct. Note, in particular, that curr-max is reset after every 
group computation, while curr-min is a global variable. 



4 Experimental Analysis 

To test our approach we ran experiments using queries 11, 12, 14 and 15 of 
the TPC-D benchmark ([3]). Said queries fit in some of our patterns. We first 
ran the queries in two commercial relational database systems. Then the queries 
were manually transformed into for-loop programs and implemented in SQL with 
cursors^. The running times were compared; all queries showed improvements 
of about 50% (i.e. ran in about half the time). It is important to point out that 
the for-loop approach was implemented using the s£ime systems and setup as 
the SQL queries; therefore both benefited from the same indices, buffer space, 
etc. In particul 2 ir, the basic relation is computed as if it were a regular SQL 
query. Therefore, it is reasonable to assume that the improvement is due to the 
fact that the for-loop avoids duplication of efforts by reducing the number of 
intermediate results needed. 

^ The intuitive idea is to extract the b^lsic relation from the tables in the database 
first (using SQL), and store it as a temporary result. The loop program can then be 
implemented in main memory using cursors, as one pass over the basic relation is 
eill that is needed. 




Optimization of Sequences of Relational Queries in Decision-Support Environments 



131 



5 Conclusion and Further Research 

We have introduced a new mechanism to implement a class of sequences of SQL 
queries. This class is not processed as efficiently as possible by relational pro- 
cessors. We presented the approach intuitively and discussed some experimental 
results. 

There are several issues that may influence practical usage of the approach. 
One is whether the approach is useful in a sufficiently wide range of circum- 
stances; another whether the approach can be integrated into existing query 
processors. The first issue cannot be answered simply, as it has some empirical 
aspects. We argue that, because of the SQL syntax, which forces groupings de- 
pendent on different attributes or selections to go on different queries, the class 
of queries covered here is significant^. With respect to the second, we note that 
the for-loop approach can be very easily expressed as in iterator and therefore 
could be incorporated into an extendible query processor like the VOLCANO 
system ([5]). Indeed, the for-loop takes a relation as input and produces a re- 
lation as output, and therefore could be embedded in an iterator module (note 
that the rest of the computation, the basic relation, is a standard SQL query). 

This paper reports work in progress. The approach should be completely 
automatized and extended to deal with more cases; for instance, cases in which 
a mix of base tables and views are used in a FROM clause. The exact relationship 
between the present approach and that of [7] should be studied. 
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Abstract. Data warehouses and on-line analytical processing (OLAP) tools 
have become essential elements of decision support systems. Traditionally, data 
warehouses are refreshed periodically (for example, nightly) by extracting, 
transforming, cleaning and consolidating data from several operational data 
sources. The data in the warehouse is then used to periodically generate reports, 
or to rebuild multidimensional (data cube) views of the data for on-line querying 
and analysis. Increasingly, however, we are seeing business intelligence appli- 
cations in telecommunications, electronic commerce, and other industries, that 
are characterized by very high data volumes and data flow rates, and that require 
continuous analysis and mining of the data. For such applications, rather differ- 
ent data warehousing and on-line analysis architectures are required. In this pa- 
per, we first motivate the need for a new architecture by summarizing the re- 
quirements of these applications. Then, we describe a few approaches that are 
being developed, including virtual data warehouses or enterprise portals that 
support access through views or links directly to the operational data sources. 
We discuss the relative merits of these approaches. We then focus on a dynamic 
data warehousing and OLAP architecture that we have developed and proto- 
typed at HP Labs. In this architecture, data flows continuously into a data ware- 
house, and is staged into one or more OLAP tools that are used as computation 
engines to continuously and incrementally build summary data cubes, which 
might then be stored back in the data warehouse. Analysis and data mining 
functions are performed continuously and incrementally over these summary 
cubes. Retirement policies define when to discard data from the warehouse {i.e., 
move data from the warehouse into off-line archival storage). Data at different 
levels of aggregation may have different life spans depending on how they are 
to be used for downstream analysis and data mining. The key features of the ar- 
chitecture are the following: incremental data reduction using OLAP engines to 
generate summaries and enable data mining; staging large volumes and flow 
rates of data with different life spans at different levels of aggregation; and 
scheduling operations on data depending on the type of processing to be per- 
formed and the age of the data. 
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Abstract. A common optimization technique in data warehouse environments is 
the use of materialized aggregates. Aggregate processing becomes complex, if 
partitions of aggregates or queries are materialized and reused later. Most prob- 
lematic are the implication problems regarding the restriction predicates. We 
show that in the presence of hierarchies in a multidimensional environment an 
efficient algorithm can be given to construct - or to derive - an aggregate from one 
or more overlapping materialized aggregate partitions (sel-derivability). 



1 Introduction 



In the last few years data warehousing has emerged from a mere buzzword to a funda- 
mental database technology. Today, almost every major company is deploying an inte- 
grated, historic database, the data warehouse, as a basis for multidimensional decision 
support queries. The purpose is to provide business analysts and managers with online 
analytical processing (OLAP). Besides the use of big parallel database servers, a com- 
mon optimization technique is to precompute aggregates, i.e. to use summary tables or 
materialized views (e.g. [3], [8], [9], [15]). Most of the presented algorithms base on the 
assumption that during the data warehousing loading process a pre-determined set of 
aggregates is materialized and used during the analysis phase. But there is also a great 
performance potential in the dynamic reutilization of cached query results ([1], [5]). 

However, today the transparent reuse of aggregates is based on limited cases of query 
containment, i.e. the query must be contained in one certain aggregate. Since the impli- 
cation problem for query restrictions containing the six comparison operators as well as 
disjunctions and conjunctions is solvable NP-hard [13], algorithms like [9] as well as 
commercial products (e.g. [3]) are based on aggregate views defined without restric- 
tions to circumvent this problem. Using this approach, the definition and reuse of aggre- 
gate partitions for hot spots, like the current month or the most important product 
group, and the reuse of queries are impossible. 



In many cases this is too restrictive. 
Consider the query “Give me the total 
sales for the video product families by 
region in Germany” and the tabular 
result illustrated in figure 1 . The mate- 
rialized query represents a partition of 



Sum(Sales) 


Camcorder 


HomeVCR 


G-East 


12 


37 


G-West 


22 


32 



Fig. 1. A partition of an aggregated data cube. 
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an aggregated data cube. If there were two redundant aggregates in the database, one 
containing the sales for camcorders and the other one containing the sales for home 
VCRs, then the query could be computed by the union of these aggregates. The goal of 
this article is to provide a constructive solution for this problem. 

In the presence of a set of materialized aggregate 
partitions, a multidimensional query optimizer has 
to determine under which circumstances and how a 
query can be computed from these aggregates 
(figure 2). The basis of our approach are multidi- 
mensional objects which were initially presented in 
[10]. Multidimensional objects provide the infor- 
mation to a multidimensional query optimizer for 
the transparent reuse of materialized aggregate par- 
titions. Their definition includes semantic informa- 
tion about it genesis, i.e. the applied aggregation 
and the selection predicates. Thus, a certain class of 
aggregation queries can be directly translated into 
multidimensional objects. Queries involving com- 
posite aggregations can at least utilize multidimen- 
sional objects based on the component aggregates. 

Structure of the Paper. The next section covers related work. Basis for the determina- 
tion of derivability are the dimensional data structures presented in section 3. Section 4 
introduces multidimensional objects and some basic operators. The derivability of mul- 
tidimensional objects with a focus on the solution to implication problems in the pres- 
ence of hierarchies is covered in section 5. The article closes with a short summary. 




Fig. 2. Derivability problem: 

Can the query be computed from the 
set of multidimensional objects? 



2 Related Work 

The general idea of precomputing summary data appeared already in [4] . In the last few 
years it became very popular with the emergence of data warehousing and OLAP and 
the resulting need for an efficient and mostly read-only access to aggregates in a multi- 
dimensional context. Several articles deal with the selection and use of materialized 
views (e.g. [5], [8], [9]; see overview in [15]). In contrast to our approach, these articles 
are not able to construct a new query from a set of materialized queries but are limited 
to certain cases of query containment. 

Summarizability and derivability are terms describing under which circumstances sum- 
mary data can be derived from other summary data. [4] investigates conditions under 
which already aggregated cells might be further aggregated. Aggregation functions are 
classified as additive and computed. These notions correlate to the distinction of distrib- 
utive and algebraic functions in [7] . The question under which circumstance a query is 
derivable from one or more other queries has been studied for a long time ([6], [13]). 
For summary data, disjointness and completeness are fundamental [4]. Another seman- 
tic condition, type compatibility, was identified by [11]. 
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3 Dimensional Data Structures 

The notion of a dimension provides a lot of semantic information especially about the 
hierarchical relationships between its elements like product groups or geographic 
regions. This information is heavily used for both aggregate queries and selections, and 
it provides the basis for the definition of multidimensional objects. 

Definition 1: A dimensional schema is a partially ordered set of dimensional attributes 
(2Xj{TotalQ};— where 2)={Di,...,Dn}. Totalj^is a generic element which is maximal 
with respect to i.e. Dj— ^Totaljifor each DjG D. 

An attribute Dj is called a direct parent of Dj, denoted as D|-^Dj, if D,— ^Dj and there 
is no Dk with Dj— ^D]^— ^Dj. 

Figure 3 shows examples for dimensional schemas illustrated as directed acyclic graphs 
according to the partial order which denotes a functional dependency, i.e. a 1 :n 
relationship. Total is generic in the sense that it is not modeled explicitly. 

Definition 2: The instances cg dom(Dj) of some dimensional attribute D|G ® are called 
classification objects or classes of D,. Dj is called the level of c. 

Moreover, domfTotalj,) := {‘ALL}. 

An instance of a dimension ® is the set of all classes c g U| dom(Dj). 

A hierarchy can be specified by a categorization, i.e. a path to Total in a dimension. By 
defining dom(Total):={‘ALL’} it is guaranteed that all classification hierarchies are trees 
having “ALL’ as the single root node. A sample classification hierarchy for the categori- 
zation Article— ^Family— ^Group—>Area—^Total is shown in figure 4. The edges in such a 
free can also be seen as a mapping from fhe descendenfs fo the ancestors. 

Definition 3: Let Dj, DjG © such that Dj->Dj. A class aG dom(Dj) is called ancestor of class 
bG dom(Dj), denoted as ancestor(a,b), if and only if a maps to b according to the func- 
tional dependency D,— ^Dj. In this case b is called a descendant of a, i.e. descen- 
dant(b,a)<^ancestor(a,b). 

The domain of a class a with respect to the dimensional attribute D, is defined as fhe 
sef of it descendents, i.e. dom(a I Dj) = (bG dom(Dj): descendant(b,a)|. 



Product Location Time 




Fig. 3. Illustration the of dimensional schemas for the product, location and time dimension as 
directed acyclic graphs. 
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4 Multidimensional Data Structures 



The following definition of multidimensional objects is an extension of the work pre- 
sented in [10]. In contrast to other multidimensional data structures (see [12]), multidi- 
mensional objects contain additional information besides the measures and aggregation 
level (granularity) which is necessary to check the derivability. One thing is that instead 
of treating measures as simple attributes the information about the aggregation opera- 
tion which was applied to the measure is made part of their definition. 

Definition 4: Let £2 be a set of additive aggregation functions. A measure is tuple M = 
(N, O), where N is a name for the corresponding fact and O e £2 u {NONE, COMPOS- 
ITE} is the operation type applied to that specific fact. 

We assume that a measure M has a numerical domain dom(M)e [R, N,(|2, Z] and £2 = 
(SUM, COUNT, MIN, MAX). Only additive operations (in the sense of [4]) are explicitly 
represented. Other operations are subsumed by the operation type COMPOSITE, i.e. 
those measures can not be used for the automatic derivation of higher aggregates. How- 
ever, for many composite operations, like AVG, one can extend our concept by implicitly 
storing SUM and COUNT. The value NONE states that a measure is not aggregated. 

Definition 5: A multidimensional object over the dimensions is a triple 

M= [M, G, S] where 

• M = (Ml Mni) = ( (Ni,Oi) (Nf^.Om) ) is a set of measures^ 

• G = (Gi Gp) is the granularity specification consisting of a set of dimensional 

attributes, i.e. G|G ®iU...U®d such that for each G,, Gj: Gi-/>Gj 

• S is logical predicate denoting the scope. 

The scope is a restriction predicate describing which data cells have been aggregated in 
this particular (sub-) cube. It may include any propositional logic expression involving 
the granularity attributes of M and any dimensional attribute that is functionally depen- 
dent on some GjG G. For example, the multidimensional object in figure 4 is 

[ (Sales, SUM), (P. Family, L.City, T. Month), {P.Area=‘Brown Goods’''L.Country=‘Germanyj ] 

Location 



3 Goods^ 
r1\s r1\s 

Month SUM(Sales) 

Fig. 4. A classification hierarchy for the product dimension and a multidimensional object. 

^ The definition of M and G as tuples is only for the sake of simplicity; the order of the elements 
does not matter. Therefore, we will also apply the set operators like € , u, n, = to M and G. 

In the following examples we will abbreviate the dimension names Product, Location and Time 
with P, L and T, respectively. 




Total 



Products 






Area 



Group 



Article 



(Brown Goods) (Whit) 



( Video ) ( Audio ) (Computers) 

AV //Yx 



Family | (^HomeVCR} (Camcorder) 



mr 



3 
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In the literature on multidimensional data models several operators were defined on 
multidimensional data cubes [12]. Most important are selections and aggregations. The 
goal of this section is to define the influence of these operators on multidimensional 
objects, especially on the measures and the scope. Since the definition of operators is 
not the topic of this article, we will only shortly mention some operators. 

Fundamental are the metric projection of the attributes M’=(Mi’ Mi<’)cM, written as 

71^.(711) = [M’, G, S], and the restriction by a predicate P, defined as ap(3W) = [M, G, S^P]. 
On set-compatible multidimensional objects and 5Vf' (i.e. M=M’ and G=G’) one can 
define the common set-operations iM'U M’= [M, G, S v S’], Mr\ !M'= [M, G, S a S’], and 
!M\ M' = [M, G, S A — iS’ ]. However, the most important operations on multidimensional 
objects are aggregations, which are defined first on the measures alone and then on mul- 
tidimensional objects. 

Definition 6: The application of an aggregation function F to a measure M = (N,0) 
results in a measure F(M)=(N,0’) where 

• O’ = F if 0=N0NE or if F=0 and Oe {SUM, MIN, MAX}, 

• O’ = COUNT if 0=C0UNT and F=SUM 

• O’ = COMPOSITE otherwise. 

A granularity specification G=(Gi,...,Gn) is finer than or equal to a granularity specifica- 
tion G’=(Gi’,...,Gk’), denoted as G<G’, if and only if for each Gj'g G’ there is a GjG G such 
that Gj— ^Gj’. For example (P.Article, L.City)<(P.Group, L.Region)<(P.Area). 

Definition 7: The aggregation of a multidimensional object Mhy a family of aggregate 
functions <I)=(Fi,...,Fm) to the granularity G’>G is defined as: 

0(G’,fW) = [(Fi(Mi),...,F^(M^)), G’,S] 

For example, if M= [ (Sales, SUM),(P.Family),(P.Group = ‘Video’) ] then 

(SUM) ( (RGroup), M) = [ (Sales, SUM),(P.Group),(P.Group = ‘Video’) ] and 
(AVG) ( 0, SW) =[ (Sales, COMPOSITE),(P.Group),(P.Group = ‘Video’) ]. 



5 Derivability in the Presence of Hierarchies 

Based on the definitions of the last section, we will now define under which conditions 
and how a multidimensional object can be computed from a set of materialized MOs. A 
necessary prerequisite to derive a multidimensional object is that the aggregation level 
of the original MOs is finer than the granularity of the derived MO. This condition 
directly corresponds to the relationship of the aggregates in an aggregation lattice [9]. 
Two further conditions, measure compatibility and reconstructibility, are necessary to 
define the derivability of multidimensional objects. 

Definition 8: A multidimensional object !M= (M, G, S) is derivable from a multidimen- 
sional object !M'= (M’,G’,S’) if and only if 

• for each measure MjG M there is Mj’g M’ such that N, = Nj’ and O = O’ or O’ = NONE 

• the granularity specification of iM'is finer than 5W] i.e. G’ < G 

• S is contained in S’, i.e. ScS’ (or S— ^S’) and S is reconstructible from S’. 
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Measure compatibility simply means that for example total sales are derivable from 
total sales. Most problematic is the third condition, one is that S is contained by S’ (con- 
sidering S and S’ as sets of dimensional elements). How to check this condition and the 
notion of reconstmctibility is explained in the next section. 



5.1 Scope Normalization 

In [13] it is shown that the problem to determine that some predicate S is implied by 
another predicate S’ is NP-complete if the predicates may contain disjunctions and in- 
equalities besides simple comparison operators. In this section we will give an efficient 
polynomial time algorithm, which solves implication problems in the presence of hier- 
archies on finite domains even for negations and disjunctions. The algorithm is based 
on compact scopes for which the determination of scope containment is very simple. 

Definition 9: A scope S is compact if S is a conjunction of positive terms and there are 
no two different terms 2^.Dj=c and 2(.Dk=c’ with ancestor(c, c’). 

Thus, the scope ((P.Family=’Camcorder’vP.Family=’HomeVCR’)*L.Country=’Germany’) is 
not compact, but (P.Group=’Video’^L.Country=’Germany’) is. 

A compact scope S is contained in a scope S’ (denoted as S— ^S’ or ScS’) if and only if 
for each term in S’ there exists a term £(.Dj=c in S such that ancestor(c’, c). For 

example, (P.Family=’FlomeVCR’)^(P.Group=’Video’*L.Country=’Germany’) but (P.Fam- 
ily=’FlomeVCR’ *L.Country=’Germany’)c(P.Group=’Video’). 

Not only scope containment, but also all problems of finding the intersections or differ- 
ences of two compact scopes can be solved simply by determining ancestor/descendant 
relationships of classes appearing in the conjunctive clauses. Both operations are based 
on the one-dimensional intersection and the difference of two classes. For the intersec- 
tion of two classes cg dom(Di) and c’g dom(Dj) holds cnc’=c if ancestor(c’,c) and cnc’=0 
otherwise. For example in figure 5 MnB=B and MnE=0. Intersections of classes in par- 
allel hierarchies like P.Family=’FlomeVCR’^P.Brand=’Sony’ are not resolved but treated as 
if it were separate dimensions. The difference of two classes can be computed by the 
algorithm ClassDifference as illustrated in figure 5 (see [2] for the complete algorithm). 




Naive difference “R” - ”A” O 
D.| =”B”vD.| =”C”vD.| =”D”vD.j =”E”v 
D.|=”F”vD.|=”G”vD.|=”H” 



Smart difference for “R” 
D.|=”B”vD2=”J”vD3=”N” 



”A” 



o 



Fig. 5. Illustration of the algorithm ClassDifference. 

Each scope can be transformed into a “minimal” disjunction of mutually disjoint com- 
pact scopes, the disjunctive scope normal form (DSNF). Based on the DSNF and the 
scope difference the scope implication problems for non-compact scopes can be solved 
in a constructive way. To explain the construction of the DSNF consider the following 
multidimensional object: 
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[(Sales, SUM), 

(P. Group, L. Region), 

(P. Group = ‘Video’ L. Region = ‘G-East’) v 
((L.Region = ‘G-West’ v L.Region = ‘G-East’) r' 

P. Group = ‘Video’ r' P. Family ^ ‘Camcorder’)] 

Intuitively, the scope definition is not minimal, because the terms L.Region=‘G- 
East’ V L.Region=‘G-West’ can be reduced to L.Country=‘Germany’. In order to find all 
such terms, it is necessary to translate the predicate into conjunctive normal form where 
such terms appear in a single disjunctive clause and can be discovered easily. This kind 
of reduction together with the replacement of negative terms is realized by algorithm 1 
(see [2] for details), which constructs the conjunctive scope normal form (CSNF), i.e. a 
minimal expression of the scope in CNF. The resulting scope for the example above is 

(P. Group = ‘Video’ v L. Country = ‘Germany’) A (P. Group = ‘Video’) A 
(L. Country = ‘Germany’) A (L.Region = ‘G-East’ v P. Family = ‘HomeVCR’) 

The translation from CSNF into DSNF is analogous to translating CNF into DNF. This 
implies that all positive terms remain positive. The example yields the following DNF: 

(P. Group = ‘Video’ a L. Country = ‘Germany’ a L.Region = ‘G-East’) V 
(P. Group = ‘Video’ a P. Family = ‘FlomeVCR’ a L. Country = ‘Germany’) 

To make the clauses compact, for each class it must now be checked if an ancestor is 
also in the same clause. If so, the ancestor is removed. This leads to 

(P. Group = ‘Video’ r' L.Region = ‘G-East’) V (P. Family = ‘HomeVCR’ L.Country = ‘Germany’) 



Sum(Sales) 


Camcorder 


HomeVCR 


G-East 


12 


37 


G-West 




32 



Algorithm: ConjunctiveScopeNormalization 

Input: Scope of a MO over dimensions ®n 

in conjunctive normai form S = 

Output: Scope S in conjunctive scope normal form 

t Begin 

2 Foreach S, 

3 replace all negative terms •DyD^^c by 

4 CiassDifference( 2 ].Total=”ALL”, Ej.Di^ito); 

5 

6 Foreach term ®j.D|(=c 

7 If (c’ with Ancestor(c’, c) is also contained) 

8 remove D-^.Dy^c 

to Foreach term £j.D|(=c 

1 1 let Dp represent a direct parent level, 

t2 i.e. £j.D|(A Dpi 

t3 p = GetAncestor(child I Dp); 

t4 If (all elements of dom(p I D^ c are in S,) 

1 5 replace c and all siblings by p; 

t6 

t7 End Foreach 

t8 

19 Return S = Si'' ... ''Sr,; 

20 End 



Algorithm 1: ConjunctiveScopeNormalization 

transforms a scope from conjunctive normal 
form to conjunctive scope normal form. 



Algorithm: PatchWork 

Input: A compact scope SC and 

a scope in DSNF S = SCiU...uSC„ 

Output: TRUE if S^SC, FALSE otherwise 

1 Begin 

2 remainder = (SC); 

3 solution = 0; 

4 

5 While remainder # 0 Do 

6 Foreach R e remainder 

7 found = false; 

8 For i = 1 To n 

9 // check if this part of the remainder is 

10 // intersected by a compact scope in S 

1 1 If lnterseotion(R, SC|)it0 Then 

12 remainder = remainder \ (R) u 

13 ScopeDifference(R, SC,); 

14 solution = solution:,jlntersection{R,SCi); 

15 found = TRUE; 

16 Break; 

17 End If 

18 End For 

19 If (Not found) 

20 Return 0; 

21 End Foreach 

22 End While 

23 Return solution; 

24End 

Algorithm 2: PatchWork constructs a solution in 
DSNF how to compute the compact scope SC 
from SCi,...,SC„. 
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By using the ScopeDifference it is now possible to make the clauses mutually disjoint. 
Therefore, a DSNF representation of the scope is 

(P. Group = ‘Video’ r' L.Region = ‘G-East’) v (P. Family = ‘HomeVCR’ L. Region = ‘G-West’) 

Based on the DSNF, the problem if a scope Si=SCnV...vSCip is contained in a scope 
S 2 =SC 2 iV...vSC 2 m can be solved constructively by algorithm 2. For each conjunctive 
term SCu the following steps are executed. The remainder is a set of “patches”, i.e. 
“compact” fragments which still must be covered. For each remaining patch R it must 
be checked if RnSC2|A0 for some i (line 1 1). If so, the intersection is removed from the 
remainder and added to the solution (lines 12-14). If for some remaining patch no inter- 
secting clause from S 2 is found, then S-| S 2 . Thus, the solution itself is a set of mutu- 

ally disjoint compact scopes (patches) and can be seen as an instruction how to compute 
SC 11 from SC 2 . 

However, this construction does not work in all cases because a multidimensional object 
still may not be reconstructible from the other one. Consider the query: “Give me the 
total sales of all video and audio products per region” expressed by the multidimen- 
sional object 

!M= [(Sales, SUM), (L.Region), (P.Group=’Video'vP.Group='Audio’)]. 

If there was an aggregate with no restrictions (equivalent to Total=”ALL” in all dimen- 
sions) 

[(Sales, SUM), (P.Area, L.Region), ()], 

then question is, if 0\f is derivable from M'. It turns out that, although G<G’ and ScS’, 
!M’ is not reconstructible from M because the two patches with P.Group=’Video’ and 
P.Group=’Audio’ can not be addressed in 5Vf’. The reason is that 5tf"has already a higher 
granularity (P.Area) than the attributes in the patch clauses (P.Group). This must also be 
checked (see definition 8). However, in case S=S’ it would work anyway. 

5.2 Set-Derivability 

A set of multidimensional objects { [M, G, SC-|],...,[M, G, SCp] } can also be seen as a MO 
M=[M, G, SCiV...vSCp] and the other way around. Since it is easy to aggregate MOs at 
a finer granularity G’<G (definitions and 8) to G, one can easily extend algorithm 2 to 
construct one multidimensional object 3 Wq from a set of multidimensional objects 
51f-i,...,5Wp at granularities G’<G (figure 6). Thus, 51^ is set-derivable from 5lT|,...,5Wp if 
algorithm 3 yields a non-empty solution. Set-derivability in conjunction with a cost- 
based selection can serve as a basis to compute one query from a set of previously mate- 
rialized queries. For an illustration consider the following multidimensional objects in 
DSNF: 

Tfj = [ (Sales, SUM), (P. Family, L.Region), (P.Group = ‘Video’ L.Country=’Germany’) ] 

514 = [ (Sales, SUM), (P.Group, L.City), ((P.Area = ‘Brown Goods’ L.Region='G-West’) ] 

5 % = [ (Sales, SUM), (P. Article, L.City), ((P.Group = ‘Audio’ r' L.Country = ‘Germany’) v 

(P.Group = ‘Video’ r' L.Region=’G-West’) ] 

The use of algorithm 2 in the context of set-derivability can be used to derive the query 
!Mq = [ (Sales, SUM), (P.Goup, L.Region), ((P.Area = ‘Brown Goods’ ^ L.Country=’Germany’) ] 
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Fig. 6. Patch-working. The requested MO fWg can be constructed from and Mt,. 



from M 2 and by the query execution plan depicted in figure 6. Such an approach 
overcomes the limitations of query containment for the reuse of cached aggregates and 
bears a high performance potential [1]. 



6 Summary and Future Work 

The determination of the derivahility of multidimensional aggregates is an essential task 
for a multidimensional query optimizer. In this article we presented multidimensional 
objects as an enriched data structure which helps to accomplish this task. We have 
shown that the inclusion of the semantics of aggregations on the measures in an 
extended multidimensional algebra allows much more flexibility for the selection of 
aggregates hy the query optimizer than is possible today. For the derivahility of multi- 
dimensional objects three conditions have to be checked: measure and granularity com- 
patibility as well as scope containment. An efficient algorithm was given which solves 
the scope implication problem in the presence of hierarchies in a constructive way. The 
potential of the approach has already been proved for a certain class of multidimen- 
sional objects. Experimental results are given in [1]. 

Future research aims at an extension of the presented concept on a more complete mul- 
tidimensional algebra, including other aggregation functions and also binary opera- 
tions. Another idea is to include comparisons on the aggregated measure attributes in 
the scope restriction, a problem that has already heen investigated in [14]. Our strategic 
goal is to supply the query optimizer with sufficient knowledge to solve problems of the 
following kind: “Given a formula Turnover=Sales*Price and an aggregated sales data 
cube, under which circumstances is it possible to use this aggregated sales cube to 
derive an aggregated turnover cube?’’ There are many possibilities under which the 
information about computed measures can be used to utilize materialized multidimen- 
sional objects for the actual computation. In several relevant cases binary operations do 
not change the operation type of the resulting measure, for example, 
(Stock, SUM)=(StockReceipt,SUM)-(Sales, SUM). In such cases the total Stock can be com- 
puted from the total StockReceipt minus the total Sales. 
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Abstract This paper presents a method for extracting the real dimen- 
sion of a large data set in a high-dimensional data cube and indicates its 
use for visual data mining. A similarity measure structures a data set in 
a general, but weak sense. If the elements are part of a high-dimensional 
host space (primary space), for instance a data warehouse cube, the 
resulting structure doesn’t necessarily reflect the real dimension of the 
embedded (secondary) space. Mapping the set into the secondary space 
of lower dimension will not result in loss of information with regard to the 
semantics deflned by the measure. However, it helps to reduce storage 
and computing efforts. Additionally, the secondary space itself reveals 
much about the set’s structure and can facilitate data mining. We make 
a proposal for adding the property of a dimension to a metric and show 
how to determine the real (in general fractal) dimension of the underlying 
data set. 



1 Introduction 

Nick Roussopoulos [16] recently pointed out the multi-faced form of views: 
’’What is a relational view? Is is a program? Is it data? Is it an index? Is it an 
OLAP aggregate?” Here we propose to add another faced to the list: Clustering 
a data set by means of a metric results in a (materialized) view for representing 
the structure of the set with respect to the given metric. 

Initial suggestions reach back to the design of algorithms for geometric data- 
structures and an early attempt was the Bisector Tree of Kalantary and Mc- 
Donald [9]. Those were generalized by a hierarchical data-structure, so called 
Metric Trees, where each node represents a data cluster and is described by a 
representative (existing or artificial) element (called the center of the cluster) 
and a cluster radius defined by the maximum distance between the center and a 
cluster element. Whenever the current cluster is too coarse (i.e. has a too large 
radius) it is recursively refined into sub-clusters, represented by sub-trees. Thus, 
the cluster-radii decrease on a path from the root down to a leaf. In doing so, 
the metric itself is used as a black-box. This mechanism was published first in 
[13] and independently in [18]. Within this data structure, navigation over data 
spaces with aggregation is supported, as it is known from A'F^-tables [21]. In 
the context of data warehousing and multimedia, those ideas resurfaced [6] , [5] , 
[2], [22], [3]. 
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Intensive theoretical and experimental studies were undertaken to establish a 
proper choice of cluster-centers in different metric spaces [14] of different appli- 
cation areas (for instance convex distance functions in Waterman distance 
measure in sequence spaces, Mahalanobis-distance, Canberra-metric, distance 
measures of Sokal&Sneath and Czekanowski&Dice in non-discrete spaces, and 
the measures of Maron, Kulczynski, and Kendall in discrete spaces), in order to 
minimize distance computing by means of a chain estimation based on the trian- 
gulation inequality [20], and to develop a dynamic paging concept [19]. However, 
the later papers that appeared in the context of data warehouses did not draw 
much from these experiences. 

Here, we focus on experimental results on a special type of Metric Trees, 
called Monotonic Bisector Tree, which revealed that the developement of the 
cluster-radii corresponds directly to the real dimension of the data set [19]. For 
instance, the cluster radii of points distributed on a one-dimensional curve in 
decreased by a smaller factor q than those scattered arbitrarily in a 2-dimensional 
manifold. At the same time q is typical for sets of same dimension under same 
Lp-metric and is independent of the dimension d of the host space. The main 
intension of this paper is to add theory to this observation and to prove that the 
decrease of the cluster radii is an indicator for the natural (in general fractal) 
dimension of a data set, which is independent of the dimension of the host- 
space. Thus, roughly speaking, a special type of Metric Tree can reveal the 
fractal dimension of a large data set in a warehouse cube of high dimension. 

The remainder of this paper is organized as follows. After giving some formal 
definitions and properties in Chapter 2, we develop the basic idea in Chapter 3 
and review briefly Monotonic Bisector Trees. Chapter 4 proves the strong rela- 
tionship between the fractal dimension of a data set and its Monotonic Bisector 
Tree. Chapter 5 makes a proposal for applying the theory in visual data mining 
and Chapter 6 closes with a summary and an outlook on future activities. The 
paper is without proofs. Please note that the proofs can be found in the full 
version of this paper (http://www.db. informatik.uni-kassel.de/'czi/paper.ps). 

2 Metric Spaces and Fractal Dimension 

Let a quasi-metric space (M, d) be composed of a non-empty data set M and a 
similarity measure d: M x M — )■ i?>o enhanced with the following properties: 

1. 3ci>i'^a,beM d(a,b) < Cid(b,a) (Quasi-Symmetry) 

2. 3c-2>i '^a,b,cdM d{a,c) < C 2 [d{a,b) -\-d{b,c)] 

(Quasi- Triangulation Inequality) 

3- ^aGMVr>o3{ai,...,a,}cM Kr{a) C [J A'j(aj) (Covering) 

l<i<l 

Thereby, for z G M and r > 0 let K^iz) := {x G M \ d{z, x) < r} denote the ball 
with radius r, centered at 2 :. In order to estimate the runtime of our algorithms, 
we additionally demand 

4. '^a,beM d(a, b) can be computed in constant time (efficiently computable) 
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Using Cl = C 2 = 1 provides a real metric. The quasi-properties in 1 and 2 
are only for technical reason and will be used to quantify the robustness of our 
algorithms with respect to skewed data or a distortion of the distance measure. 
For this reason we define: 

1 

C 2 (Cl + 1 ) 

In the metric case, e = ^ holds. We assume Ci, C 2 to be close to 1. Please note 
that d{a, a) = 0 is not explicitly demanded since a quasi-non-negativity 

'^a,beM d{a,a) < C 2 {d{a,b) + d{b,a)) < C 2 {d{a,b) + Ci d{a,b)) = e~^ d{a,b) 

is implied by 1 and 2, depending on the density of the underlying data set. 

The coverage property reveals a fractal Dimension ( d = logj I according to 
Mandelbrot’s theory [12]) of the metric space. Intuitively we are familiar with 
this idea since a hypercube in R‘^ (i.e. a circle under Loo-metric) with edge-length 
s can be covered with 2“^ sub-cubes of edge-length s/2. Besides this approach, 
space-filling (fractal) curves have been already used for the refined stretching of 
the 1?“* [11], [1]. 

Lemma 1. The four properties are independent. 



Definition 1. Let S C M be finite. For eaeh p G S the set 

Vs{p) :={xG M \ d{p,x) = 0 V d{p,x) < d{s,x)} 

Vs{p) :={x£M \ \f,^s d{p,x) < d{p,s)} 

is ealled the open (elosed) Voronoi-region of p with respeet to S. For a,b G M 
the set 

F[{a,b) := {x £ M \ d{a,x) < d{b,x)} = V^a,b}{o) 

H°{a,b) := {x £ M \ d{a,x) = 0 V d{a,x) < d{b,x)} = V^a,b}{o.) 

is ealled the open (elosed) Ftalf-spaee of a with respeet to {a,b}. For a,b £ M 
the biseetor of a and b is defined: 

B{a,b) := H{a,b)f]H{b,a) 

A biseetor B{a,b) separates two sets A, B C M ijf A C F[{a,b) and B C F[{b,a). 
The set of all elosed Voronoi-regions 

VD{S) -.= {Us(p) I P e 5} 



is ealled the Voronoi-diagram of S. Figure 1 gives an example. 
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Figurel. Voronoi-diagram of 6 points in the Euclidean plane 



Lemma 2. Voronoi-regions are invariant with respect to a strictly monotonic 
transformation (i.e. a transformation of d to := <Pod, using a a strict mono- 
tonic mapping : i?>o — > R>o, ^(0) = 0) of the distance measure d. 

Lemma 2 proofs the Voronoi-partitioning of a set to be invariant under a 
class of distance measures. Additionally, we have to examine the impact of a 
monotonic transformation on the topological constants {C\ , C 2 on page 2) . 

Lemma 3. Applying a strictly monotonic increasing transformation d^, := <Pod 
on d will not worsen (heighten) C\,C 2 if 



$ 

id 



R 



>0 



R>o,x 



^{x) 

X 



is monotonically decreasing, whereby I can not improve (lessen). Conversely, I 
will not worsen and Ci , C 2 will not improve if ^ is monotonically increasing. 



Besides proving a robustness with respect to skewed data, the last two lem- 
mata demonstrate, how a carefully chosen, neighborhood-respecting distortion 
of the distance measure (for instance an acceleration) can be used to improve the 
properties of the measure. This might be of interest while experimenting with 
different distance functions. 

Definition 2. Let a Cluster be a tuple (S, z), defined by the cluster center z £ M 
and the finite base set S C M . For each (S,z) let 

r(S, z) := max{d( 2 :, s) | s € 5} 



denote the radius. 

With respect to the distance measure d, a center acts as a representative of a 
cluster and the radius quantifies this relationship. 
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3 An Index Based on a Hierarchical Clustering Method 

At first, we point to a simple observation: Let the distance measure d be any Lp- 
metric. Imagine a finite set S of points scattered in the one-dimensional interval 
[—1,1] and let 0,1 € S. Taking the 0 for the cluster center we gain a cluster 
(5, 0) with radius r{S, 0) = 1. Next, we partition S by means of the bisector of 0 
and 1, yielding two sub-clusters (5o, 0) and (5i, 1) with radii ro = 1 and ri = 0.5 
where 5i = {s € 5 | d(0, s) < d(l, s)} and 52 = {s G 5 | d(0, s) > d(l, s)}. In the 
consecutive step So is split by the bisector of its farthest point in So which results 
in radii roo,roi < 0.5. Applying this simple partitioning method recursively in 
a depth-first-search manner, provides a reduction of the actual cluster-radius by 
factor 0.5 after at most two consecutive separation steps. After mapping [0, 1] 
into the plane (for instance by x i — )■ (x,sinx)) this method results in the same 
reduction rate. But, after scattering the points of 5 arbitrarily in the plane (mak- 
ing full use of the two dimensions) , there are about 8 consecutive steps necessary 
(depending on the used p) to halve the radii. Mapping the points into a fractal 
of dimension 1 < d < 2 (for instance by x i — )■ (x,sinl/x)) results in a value 
between both thresholds. In order to examine this observation, we reintroduce 
a data structure which is directly implied by the mechanism mentioned above. 

Definition 3. Let (5, sq) be a eluster where S C M and so G 5. 

Let a: V(S) x 5 — > {0, 1} be a so ealled truncation function, equipped with the 
following properties: 

V.esVs'cS |5'| < 2 ^ a{S',s) = 1 and 
a{S',s) = l a{S",s) = l. 

A Monotonic-Bisector-Tree MBT(S,so) is a rooted binary tree having the fol- 
lowing features: 

1. If a(S,so) = 1, then MBT(S,so) eonsists of a single node eontaining the 
elements of S. 

2. If a(S,so) = 0, then MBT(S,so) is a binary tree with rootw. w eontains so 
and Si G 5\{so}- The sons of w are the sub-roots wo andw\, eorresponding 
to the two subtrees MBT(So,so) and MBT(Si,si), where for i G {0,1} 
holds: 



SiHSi-i = $ A SnH°{si,si-i)cSiCSnH{si,si-i) 

In this eontext, the radius of a MBT(S, s) denotes the radius of the eorrespond- 
ing eluster (S,s). The radius of a node denotes the radius of the eorresponding 
subtree. The elusters given by the leaf-nodes of the tree (buekets) are ealled ter- 
minal clusters. 

The truncation function a determines the fineness of the decomposition of (5, sq) 
based on |5| and r{S,s). We assume in the following that a{S,s) can be com- 
puted in 0(1 5|) time and storage. We call a Monotonic Bisector Tree eompletely 
developed, iff a(5, s) = 1 |5| < 2 holds. 
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Obviously, this binary tree structure stores n + 1 different points in n nodes. 
Thus, the storage needed is linear in |5|. Figure 2 gives an example. Please note, 
that the redundancy implied by the inheritance of cluster-centers is just of logical 
nature and can be physically avoided by storing only the new center in a node 
and using a simple stack operation to retrieve the cluster center of an ancestor 
node. 
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/ \ 
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Figure2. Monotonic Bisector Tree in the Euclidean plane 



Lemma 4 . MBTs do not have eccentric sons. The radii of the elements stored 
in nodes which appear on a path from the root to a leaf generate a monotonously 
decreasing sequence (justifying the term: monotonic tree). 

Up to now, the initial cluster center of a MBT was taken for granted. This 
is not the general case. The trivial determination of the best (with smallest ra- 
dius) center results in 0(|5p) distance computations. This might be unbearable 
in practice. The Computational Geometry community has brought forth faster 
algorithms which work in special spaces [15], [7]. On the other hand, however, 
the deviation caused by a randomly chosen initial center is bounded, since the 
triangulation inequality holds: 



d{x,y) < C2 (d(x,Zopt) + d{zopt,y)) < C2 (Ci d(zopt,x) + d{zopt,y)) < 

where x,y G S and Zgpt denotes the best initial center with radius ropt- 

The index built-up algorithm, is straightforward to the motivation of this 
chapter: 

1. Start with a cluster. 

2. Use this cluster-element for the alternative cluster-center which defines the 
current cluster radius. 
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3. Split the current cluster by means of the bisector of the two centers and 
assign the resulting sub-clusters to two subtrees. 

4. Recursively apply the partitioning-method to the subtrees, until the thresh- 
old value of the truncation-function is satisfied. 

If we are faced with the problem, that the number of cluster centers that can be 
computed is bounded (c-means-clustering [4]), we vary the build-up algorithm 
and store the current cluster not in a stack, but in a max-heap. Thus, the par- 
titioning step selects recursively the leaf-cluster with biggest radius (maximum 
error method). 

4 Discovering the Fractal Dimension 

Theorem 1 For each 0 < q < 1 there is a I := l(q) € IN, such that the build-up 
algorithm creates on every cluster (S,s) with S C M and center s G S a MET 
with the following feature: 

For every node v of height h it holds: r{v) < q~‘ r{S,s) . 

Thereby the required storage is 0(|5|). 

In summary we can say that the simple build-up algorithm forces the cluster- 
radii of a MET which appear on a path from the root down to a leaf to generate 
a geometrically decreasing sequence. Due to the fact that this holds for any host 
space, the decreasing rate is depending of the covering feature of the host space 
with lowest dimension, the natural dimension of the data set itself. This is in 
general a fractal dimension. Now, after observing the reduction rate l{e^q), we 
can compute this dimension: 

d = logi/^ 2 g {l{e‘^q) + l) 

Now, the existence of hidden rules is proofed, if there is a discrepancy between the 
dimension of the host-space (data-cube) and the calculated dimension. Although 
this statement is the basic intension of the paper, we additionally state some 
features which ease the handling of a MET. The next theorem examines the 
resolution of a MET. 

Theorem 2 Let (S, s) be the initial cluster. If the truncation function 
a: V{S) X S — > {0, 1} satisfies 

/ ’ > C > 1 a(S',s') = 1 {Terminate luster) 

r{S, s') 

the height of the resulting tree can be estimated h € O(logC'). 



The next theorem proofs the asymptotic optimal performance of the built-up 
algorithms of a MET with respect to its height: 
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Theorem 3 The build-up algorithms need 0((l-\-h) |5|) time to ereate a MBT 
on a eluster (S, s) with height h . 

Because the height of a MBT is bounded by its cardinality |5|, we derive from 
theorem 3 and theorem 2: 

Theorem 4 Let C > 1 be given. The build-up algorithms need 0(min{logC' + 
1, |5|} |5|) time for the ereation of a MBT on (S,s) sueh that > C holds 
for any leaf b. 

In order to ease the experimental handling of different distance measure we state 
the last theorem which is directly implied by lemma 2. 

Theorem 5 MBTs ereated by the build-up algorithms are invariant with respeet 
to a monotonie transformation of the distanee measure. 

5 User Interface for Visual Data Mining 

Based on the concept of coarsen and refinement, we suggest the visualization of 
a MBT for interactive visual data mining. Visualization is based on the develop- 
ment of a MBT with the help of the build-up algorithm until a threshold value 
for the number of leaf nodes is reached. This can be controlled by a suitable 
truncation function a, which the user can adjust. 

Next, the resulting tree is mapped onto the screen: The cluster center of the 
root is associated with the middle of the window, the origin. The sub-clusters 
are arranged as discs in circular order around the origin. There are three degrees 
of freedom to be assigned: The size of a disc, its distance from the origin and the 
position of a disc in circular ordering around the origin. We suggest to associate 
the volume (fatness) of a disc to its significance (for instance the cardinality of 
the corresponding cluster) and to order the discs around the origin with respect 
to the corresponding cluster radius. The distance of a discs from the origin may 
reflect its distance from the center of the root. In this context, the distance of two 
cluster may be defined by the distance of the centers or the min/max-distance 
of the corresponding sets. Of course, the assignment of quantitative tree features 
to the three degrees of freedom is within the user. 

The interaction is carried out by a generalization of the finger concept [21] as 
it is known from the navigation in VF^-tables. For meaningful illustrations of 
VF^-tables, compare [8] and [10]. A finger is a cursor for navigating in complex 
(structured) objects. Instead of a linear cursor, it offers three dimensions for 
incremental positioning: zoom in/out, next/last item of a list, next/last attribute 
within a record. Here, however, the hierarchy is of quantitative nature instead 
of a VF^-tables qualitative nature. Mouse clicks and double-clicks on the discs 
can be used to roh-up/drih-down or zoom/unzoom the corresponding cluster. 
The circular order of the discs and the fatness of the discs can be orthogonally 
used to activate a ’’successing” cluster by pressing an arrow-key. 
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6 Summary and Outlook 

In order to structure a high-dimensional data set, we have decided to index our 
space by means of a metric. Each individual view on the data must be expressed 
in terms of a metric. For each metric an index is created which materializes the 
metric view on the data set. We have shown that this index can be used to 
discover the real dimension of a large data set with respect to the chosen metric. 
We have proved, that the mechanism is fault tolerant with respect to the metric 
and to the data. We made a proposal to apply the finger technique on this index 
in order to support visual data mining. We demonstrated, that this approach 
is an alternative to plane projections, because the navigator can make full use 
of the real dimension of a data set and, by using polar coordinates, can choose 
a third degree of freedom. The main advantage is that the navigation works 
sensitive with respect to the given metric. 

We have already implemented the finger concept on -tables by using 
Tcl/Tk [17]. Now, we plan to expand this user interface to handle metric navi- 
gation as described in the last section. 
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Abstract. Database systems offering a multidimensional schema on a logical 
level (e.g. OLAP systems) are often used in data warehouse environments. The 
user requirements in these dynamic application areas are subject to frequent 
changes. This implies frequent structural changes of the database schema. In 
this paper, we present a formal framework to describe evolutions of multidi- 
mensional schemas and their effects on the schema and on the instances. The 
framework is based on a formal conceptual description of a multidimensional 
schema and a corresponding schema evolution algebra. Thus, the approach is 
independent of the actual implementation (e.g. MOLAP or ROLAP). We also 
describe how the algebra enables a tool supported environment for schema 
evolution. 



1 Introduction 

The main idea of a data warehouse architecture is the replication of large amounts of 
data gathered from different heterogeneous sources throughout an enterprise. This data 
is used by knowledge workers to drive their daily decisions. Consequently, easy-to-use 
interactive analysis facilities on top of the data warehouse are necessary. Most often 
multidimensional databases (specifically OLAP systems) are used for this purpose. 
These multidimensional information systems (MDIS) provide the user with a multi- 
dimensional view on the data and offer interactive multidimensional operations (e.g. 
slicing). 

The user of an MDIS interactively formulates queries based on the structure of the 
multidimensional space (the MD schema of the database). This means that the schema 
of the MD database determines what types of queries the user can ask. Thus, the de- 
sign of the schema in such an environment is a very important task. This has been 
recognized by the research community as several publications in the field of multidi- 
mensional schema design show (e.g. [7], [12]). Nevertheless, a complete methodology 
for designing and maintaining an MDIS must also take schema evolution into account 
which has so far received almost no attention. This paper provides a framework to 
formally approach the evolution issue for MDIS and shows how this formal frame- 
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work can be used to implement a tool supported evolution process. In such an envi- 
ronment, the designer can specify the required schema evolution on a conceptual level 
and the corresponding implementation is adapted automatically. 

To understand why schema evolution plays an important role especially in decision 
support environments (where data warehouse and OLAP applications are mostly 
found), let us first take a look at the typical design and maintenance process of such a 
system. Interactive data analysis applications are normally developed using an itera- 
tive approach. The main two reasons for this very dynamic behavior are: 

• the interactive multidimensional analysis technology is new to the knowledge 
worker. This means that it is impossible for him to state his requirements in ad- 
vance. 

• the business processes in which the analyst is involved are subject to frequent 
changes. These changes in business processes are reflected in the analysis require- 
ments [15]. New types of queries that require different data become necessary. Be- 
cause the schema of an MDIS restricts the possible analysis capabilities, the new 
query requirements lead to changes in the MD database schema. 

A single iteration of the design and maintenance cycle [17] consists of the phases 
‘Requirement Analysis’ (where the requirements of the users concerning data scope, 
granularity, structure and quality are collected), the ‘Conceptual Design’ (where the 
required views of the users are consolidated into a single conceptual model and - 
during further iterations of the development cycle - the schema is adopted according 
to the changed requirements), the ‘Physical (Technical) Design’ (where implementa- 
tion decisions are taken), ‘Implementation’ (rather mechanic realization of the speci- 
fications developed during the technical design phase), and the ‘Operation’ phase 
(where new data is loaded to the database on a regular basis and the users analyze 
data). Typically, when a system is in operation, new requirements for different or 
differently structured data arise. If a certain amount of new requirements is reached, a 
new iteration is started. 

The conceptual multidimensional data model is the central part of the design and 
maintenance cycle as it already contains a consolidation of all user requirements but 
does not yet contain implementation details. All data models that occur later in the 
design process are refinements of the conceptual model. 

Thus, the starting point for our research of schema evolution operations and their 
effects is the conceptual level. The goal of our approach is to automatically propagate 
changes of the conceptual model to the other models along the design cycle. A prereq- 
uisite for this is a formal framework to describe the evolution operations and their 
effects, which we present in this paper. 

The rest of the paper is structured as follows: In section 2 we discuss related work 
from the areas of data warehousing and object-oriented databases. Section 3 summa- 
rizes the objectives and benefits of our framework for multidimensional schema evo- 
lution. Section 4 develops a formal notion of multidimensional schemas and instances 
which serves as a basis for the description of schema evolution operations and their 
effects that is described in section 5. Section 6 sketches how the formal framework 
can be used to implement an interactive tool-supported schema evolution process. We 
conclude with directions for future work in section 7. 
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2 Related Work 

A first approach to changing user requirements and their effects on the multidimen- 
sional schema is [11], Kimball introduces the concept of “slowly changing dimen- 
sions” which encompasses so called structural changes, i.e. value changes of dimen- 
sion attributes, like changing the address of a customer. Since data in an OLAP system 
is always time related, the change history has to be reflected. If a customer moves, 
both the old and the new address have to be stored. Solutions for this case (data evolu- 
tion) are rather straightforward. The slowly changing dimensions approach is not 
complete and provides a rather informal basis for data and schema changes. Further, 
there are no clear decision criteria for the proposed implementation alternatives. 

Golfarelli et al. [7], [8] proposed a methodological framework for data warehouse 
design based on a conceptual model called dimensional fact (DF) scheme. They intro- 
duce a graphical notation and a methodology to derive a DF model from E/R models 
of the data sources. Although the modeling technique supports semantically rich con- 
cepts it is not based on a formal data model. Furthermore, the framework does not 
concentrate on evolution issues which we believe is an important feature for the design 
and maintenance cycle. 

Schema evolution has been thoroughly investigated in the area of object-oriented 
database systems (OODBMS) because - similar to the multidimensional case - there 
are conceptual relationships representing semantical information: the isa relationships 
representing inheritance between classes. Schema evolution in object-oriented data- 
base systems has been broadly discussed both in research prototypes and commercial 
products (e.g. [1], [20], [10]). We take these approaches as a foundation for our work 
and investigate how techniques and approaches from object-oriented schema evolution 
can be adopted to the case of multidimensional information systems. 

Most research work in the area of OLAP and data warehousing concentrates on 
view management issues. These approaches see the warehouse database as a materi- 
alized view over the operational sources. The arising problems are how these views 
can be maintained efficiently (view maintenance problem, see e.g. [6]), which aggre- 
gations on which level improve performance with given space limitations (view selec- 
tion problem, see e.g. [3]), and how the views can be adopted when changes in the 
view definition or view extent arise (view adaptation and synchronization problem, 
see e.g. [14], [16]). Our work supplements these approaches because we develop the 
warehouse schema from the user requirements and not from the schemas of the opera- 
tional sources. 

A recent approach to schema evolution is [9]. From the related work mentioned 
above this approach is the most closely related to our work. However, it differs in the 
following aspects. First, it only addresses changes in the dimensions. We provide a set 
of evolution operations covering also facts and attributes. Next, insertions of levels are 
limited to certain positions in a dimension. Our framework allows random insertions 
of dimension levels at any place of a given MD schema. Further, our approach is 
based on a conceptual level, thus not assuming any specific implementation details 
(e.g. a ROLAP implementation). 
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3 An Overview of the Schema Evolution Framework for MDIS 

The objective of the work described in this paper is to propose a methodology that 
supports an automatic adaptation of the multidimensional schema and the instances, 
independent of a given implementation. We provide a conceptual multidimensional 
schema design methodology including a schema evolution algebra. Our vision of the 
warehouse design and maintenance cycle is that the whole system is specified and 
designed on a conceptual level. Changes that arise when the system is already in pro- 
duction would only be specified on the conceptual level. Our design and evolution 
environment cares for the necessary changes in the specific target system (i.e. data- 
base and query /management tools). 

To this end, our general framework comprises 

1 . a data model (i.e. a formal description of multidimensional schemas and instances), 

2. a set of formal evolution operations, 

3. the descriptions of effects (extending to schema and instances) of the operations, 

4. an execution model for evolution operations, and finally 

5. a methodology how to use our framework. 

This paper addresses points one, two and three. It further contains ideas for the meth- 
odology (point five). The main objectives of our framework for multidimensional 
schema design and evolution (see [2] for the complete list ) are 

• automatic adaptation of instances: existing instances should be adapted to the new 
schema automatically. Further, the adaptation of instances should be possible sepa- 
rately from the adaptation of the schema (in case there are no instances yet), physi- 
cal and/or logical adaptation should be possible. 

• support for atomic and complex operations: our methodology defines atomic evo- 
lution operations as well as complex operations. 

• clear definition of semantics of evolution operations: the semantics of a given 
schema evolution operation may offer more alternatives and are not always clear. 
Our methodology fixes an alternative for execution. 

• providing a mechanism for change notification (forward compatibility): we provide 
a change notification mechanism and guarantee that existing applications do not 
have to be adapted to the new schema. Further, there is no need for immediate ad- 
aptations of the tool configurations. 

• concurrent operation and atomicity of evolution operations: the framework should 
allow concurrency of schema changes and regular queries. Further, schema evolu- 
tion transactions shall be atomic. 

• different strategies for the scheduling of effects: the framework should offer lazy 
strategies for the execution of effects of a schema change. Based on a cost model, 
the system may schedule the execution of effects for a later point in time. If the 
adapted instances are needed immediately, the system notifies the user about possi- 
ble arising performance problems. 

• support of the design and maintenance cycle: the framework supports all phases of 
the design and maintenance cycle. Thus, we cover not only the initial design (where 
the OLAP system is not populated with instances yet), but allow also adaptations of 
a populated system. 
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Summing up, our approach shall be used as a basis for tool-supported warehouse 
schema changes. The framework provides an easy-to-use tool allowing to perform 
schema modifications without detailed knowledge about the specific implementation 
and tools. The schema designer does not have to adapt different configurations of a 
tool and a database schema which must be consistent for a given implementation, but 
the tool is responsible for performing the necessary steps in a consistent and semanti- 
cally correct way providing a single point of control accessible via a graphical for- 
malism. 

The contributions and scope of this paper are a formal model for MD schemas and 
instances together with a set of evolution operations. We describe the effects of these 
evolution operations on the MD schema and the instances, thus providing a formal 
algebra for MD schema evolution. 



4 Multidimensional Schema and Instances 

Several interpretations of the multidimensional paradigm can be found both in the 
literature (e.g. [4], [5], [13], [21]) and in product implementations. A comparison of 
the formal approaches shows that most of them do not formally distinguish between 
schema and instances ([19]) as their main goal is a formal treatment of queries using 
algebras and calculi. For our research work, we need a formalism that can serve as a 
basis for defining the schema evolution operations and their effects (see section 5). 
Therefore, this section contains a formal definition of a multidimensional schema and 
its instances (which was inspired by the formal multidimensional models mentioned 
above, esp. [4], [5], [21]). 

The schema (or MD model) of an MDIS contains the structure of the facts (with 
their attributes) and their dimension levels (with their attributes) including different 
classification pathes [19]. We assume a finite alphabet Z and denote the set of all 
finite sequences over Z as Z*. 

Definition 4.1 (MD model, MD schema): An MD model M is a 6-tuple <F, L, A, 
gran, class, attr> where 

• F cz Z* is a finite set of fact names jf,,. . .,f„,} where f^ e Z* for 1 < i < m 

• L cz Z* is a finite set of dimension level names [Ij,. . .,lj,} where I e Z* for 1 < i < k. 

• A cz Z* is a finite set of attribute names |aj,...,ap} where a, g Z* for 1 < i < p. 
Each attribute name has a domain dom(a) attached. 

• The names of facts, levels and attributes are all different, i.e. L n F n A = 0 

• gran: F ^ 2^^ is a function that associates a fact with a set of dimension level 
names. These dimension levels gran(f) are called the base levels of fact f. 

• class c L X L is a relation defined on the level name. The transitive, reflexive clo- 
sure class' of class must fulfill the following property: (Iplj) g class' => (Ij,!,) i 
class'. That means that class' defines a partial order on L. (Iplj) e class' reads “Ij 
can be classified according to l^.” 

• attr: A ^ FuLu[_L}isa function mapping an attribute either to a fact (in this 
case the attribute is called a measure), to a dimension level (in this case it is called 
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dimension level attribute) or to the special _L-symbol which means that this attribute 
is not connected at all. 

The MD model formalizes the schema of a multidimensional database. We use this 
formalism in section 5 to define a set of schema evolution operations. As we also want 
to analyze the effects of schema evolution operations on the instances of the schema, 
the remainder of this section presents a formal model for instances. 

Definition 4.2 (Domain of a Dimension Level): The domain of a dimension level 1 e 
L is a finite set dom(l) = {m^, ....m^^} of dimension member names. 

Definition 4.3 (Domain and Co-domain of a fact): For a fact f the domain dom(f) 
and co-domain codom{f) are defined as follows: 

dom( f ) ■= X dom{l) 

legmn(f) 

codom(f ) := X dom{a) 

{a\attr{a )= /} 

Definition 4.4 (Instance of MD model): The instance of an MD model HI = <F, L, 
A, gran, class, attr> is a triple 5m = <R-UP, C, AV> where 

• R-UP = { r — up\lly }is a finite set of functions with 

r - : dom{lev\) — » dom{lev2) for all {levl, lev2) e class 

• C = {cy^ ) ; fi £ F VI <i<misa finite set of functions 

c,: dom(f) ^ codom(f); f e F. C maps coordinates of the cube to measures, thus de- 
fining the contents of the data cube. 

• AV = { aVj, . . . , av^ } is a finite set of functions which contains a function av^ for 
each attribute a that is a dimension level attribute, i.e. attr{a) g L. The function av^: 
dom(attr{a))—> dom(a) assigns an attribute value (for attribute a) to each member of 
the corresponding level. 



5 Evolution Operations 

After having formally defined the notion of a multidimensional schema and its in- 
stances, we present a set of formal evolution operations for MD models. For each 
operation, we introduce the operation with its parameters and describe the effect on 
the MD schema. Since the operations usually do not work on an database without 
instances, the modification of existing instances is also given. 

Together with the exact definition of an MD model and an instance of such an MD 
model (in chapter 4), the evolution operations listed here provide a formal schema 
evolution algebra. A formal property of this schema evolution algebra is its closure, 
i.e. an algebra or language is closed if the result of any allowable operation is a legal 
construct in the language. The closure of our schema evolution algebra can be for- 
mally proved because the algebra recognizes only one underlying construct, an MD 
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model. Every operation is defined to take an MD model as input argument and pro- 
duces as output a new MD model. Therefore, by definition, the algebra is closed. 

We provide a minimal set of atomic operations. Atomic refers to the property that 
every operation only ‘tackles’ (i.e. changes) exactly one set or function of the given 
MD model. Of course, for notational convenience, complex operations can be de- 
fined. These complex operations consist then of a sequence of atomic operations. A 
formal proof of the minimality and completeness of this set of operations is ongoing 
work. 

Formally, a schema evolution operation op transforms an MD model Tfl =<F, L, A 
gran, class, attr> to an MD model lU ’ . Some operations also require an adaptation of 
the instances 5m to 5’m’. We always denote elements before the evolution with the 
regular letter (e.g. F), whereas a letter with an apostrophe (e.g. F’) denotes the corre- 
sponding element after the evolution. For a function f'.dom^codom let denote 

the restriction of f to dom ’ a dom 

1. insert level; this operation extends an existing MD model by a new dimension 
level. The operation extends the set of levels without changing the classification 
relationships, thus creating an isolated element. Classifications Relationships have 
to be defined separately. 

Parameter: new level name l„^^g F. 

1U’=<F, F’ := L u { 1„^^ }, A gran’, class’, attr’> 
gran’: F — >2^ ; gran’(f) := gran(f) 

class’ c L’ X L’; (l^y g class’ :<=> (l^y g class F’ 

attr’: A ^ F U F’ u {_L}; attr’(a) := attr(a) 

No effects on instances because informally the operation introduces a new and 
therefore empty dimension level. 5’m' = <R-UP, C, AV> 

2. delete level: deletes a dimension level 1. , from an MD model. The level must not 

del 

be connected to a fact (1^^, i gran(f) VfG F) or via classification relationships 
((Ij^,, 1) i class a(1, Ij^,) i class Vie F’). Further, the level must not have any attrib- 
utes attached (attr(a) 1^^, Vae A). Instances are deleted automatically together 
with the dimension level. Parameter: level name T ,g F. 

del 

HI’ =<F, L’ := L \ { Ij^i }, A gran’, class’, attr’>. 
gran’: F -^2 ^ ; gran’(f) ;= gran(f) 

class’ c F’ X L’; (l^y g class’ :<=> (l^y g class Vl^,fe F’ 
attr’: A ^ F u F’ u {_L}; attr’(a) := attr(a) 

Instances: no effect because dimension members are deleted automatically. 
5’m’ = <R-UP, C, AV> 

3. insert attribute: creates a new attribute without attaching it to a dimension level 
or fact. Assigning an existing attribute to a dimension level or fact is a separate 
operation (connect attribute). Parameter: attribute name A with dom(a„^^). 
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1TI’=<F, L, A’ := A u { } gran, class, attr’> 

attr’: A’ ^ F u L’ u {_L}; attr’(a) := attr(a) 

Instances: no effect, d’m’ = <R-UP, C, AV> 

4. delete attribute: deletes an existing, but disconnected attribute (i.e. the attribute 
is not attached to a dimension level or fact). 

Parameter: attribute name a^ ,e A 

del 

TH’ =<F, L, A’ = A - { a^^ } gran, class, attr’> 
attr’: A’ ^ F u L’ u {_L}; attr’ {a) := attria) 

Instances: no effect, 5’m’ = <R-UP, C, AV> 

5. connect attribute to dimension level: connects an existing attribute to a 

dimension level Z e L. Parameters: attribute name a^^^ e A; dimension level / e L, 

a function g for computing This function can also assign a default value. 

?7I’=<F, L, A gran, class, attr’> 

, , \ I if a = 

aft/: A^FuLu{l} attr\a)-.= { 

[affr(a) else 

Instances: d’m’ = <R-UP, C, AW, R-UP not changed, C not changed. 

AV’: define av : dom (1) domta ), AV’ := AV u lav } 

anew v ^ ' new' ’ i anew J 



VwG dom(l) : av^^^^(m) := g(m) 

6. disconnect attribute from dimension level: disconnects an attribute a , , from a 

del 

dimension level Z g L. Parameters: attribute name a^^, g A; dimension level Z g L. 

1U’=<F, L, A, gran, class, attr’> 

/ / f -L '7 « = tlM 

attr :A^FuLu{_L} attr(a):=< 

[affr(a) else 

Instances: d’m’ = <R-UP, C, AV>, R-UP not changed, C not changed. 

AV: let av^j^i be the corresponding attribute value function for a^^j, 
AV:=AV-{a;i) 

7. connect attribute to fact: connects an existing attribute a,,^ to a fact/G F. 
Parameters: attribute name a„„, g A; fact/G F, a function g for computing a„^„. 
?7I’=<F, L, A gran, class, attr’> 

f ifa = a„ 

[attr(a) 

Instances: d’m’ = <R-UP, C’, AV>, R-UP not changed, AV not changed. 

C := C - {c,} u {c,'}; c,': dom(f) ^ codom(f) with 



attr ' : A ^ F u L U { -L) attr' (a) := 



new 

else 



c’f (x) := (zi,...,z„,z„+i) with (zi,...,z„) = Cf(x) and z„+i = g(x) 

8. disconnect attribute from fact: disconnects an existing attribute a^^^ from a fact 
fe F. Parameters: attribute name a^.^G A; fact/G F. 
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?7I’=<F, L, A gran, class, attr’> 

, , I _L if a = a 

attr :A^FuLu{_L} attr(a):=< 

[atfr(a) else 

Instances: d’m’ = <R-UP, C’, AV>, R-UP not changed, AV not changed. 
C := C - {cj u {c/}; c/: dom(f) ^ codom(f) with 



c’f (x) := with izi,...,z„.i,z„) = c fix) 

9. insert classification relationship: this operations defines a classification rela- 
tionship between two existing dimension levels. Parameters: levels Ij,!^ e L 
jn’=<F, L, A gran, class’, attr> 
class’ = class U {(Iplj)} 

Instances: 5’m’ = <R-UP', C, AV>, C not changed, AV not changed. 

R-UP' := R-UP u { r - up'ii }, V m e dom(l|): r - up\^ (m):= k, ke domCy. 



/2 /2 

Additionally, r — upn (dom(l,))c dom(lj), i.e. r — upn is well-defined 
V m G dom(l|). 



10. delete classification relationship: removes a classification relationship without 
deleting the corresponding dimension levels. Parameter (l[,l 2)6 class 
jn’=<F, L, A gran, class’, attr> 
class’ = class - {(l^y} 

Instances: 5’m’ = <R-UP', C, AV>, C not changed, AV not changed. 

R-VF :=R-UF- {r-up‘, I } 



11. insert fact: this operation extends the MD model by a new fact. The operation 
extends the set of facts without attaching dimension levels to this fact. Dimen- 
sions for this fact have to be defined separately. Parameter: new fact f,,^^ i F 
1U’=<F’ := F u{f„^„), L, A gran’, class, attr’> 



gran’if) : F’^ 2^ 



gran if) := 



j 0 if 

[granif) 



f = fn. 



else 



attr’: A ^ F’ u L u {_L}; attr’ id) := attrid) 

Instances: ‘H’m’ = <R-UP, C, AV>, R-UP not changed, AV not changed. 

C’: C U ), Cf^^^ \ dom (f„J codom(f„J, define c(x):= 1 

V XG dom(f„„) 

12. delete fact: removes a fact f . , from an MD model. The fact must not be con- 

del 

nected to a dimension (gran(fj^,)= 0 ) and must also not contain any attributes 
iattrid) t- fj^, V aG A). Parameter: name of fact to be deleted f^^, g F 
J1V=<F’ := F -{fjj,l, L, A gran\^,, class, attr’> 
attr’: A^F’ uLu {_L} attr’ id) = attrid) 

Instances: ‘H’m’ = <R-UP, C, AV>, R-UP not changed, AV not changed. 




162 



M. Blaschka, C. Sapia, and G. Hdfling 



13. insert dimension into fact: inserts a dimension at a given dimension level into an 
existing fact, thus increasing the number of dimensions by one. Parameters: level 
name 1 e L and fact name f g F. Additionally, a function nv is provided defining 
how to compute the new values for the fact based upon the now extended set of 
dimensions and the old value of the fact. Each cell of the old cube now becomes a 
set of cells, exactly reflecting the new dimension. This means that each old value 
of the fact is now related to all elements of the new dimension. For instance, as- 
sume we have daily repair cases of cars stored without the brand (i.e. we have no 
distinction between the brand of cars). Now we want to include the brand mean- 
ing that we insert a dimension at the level brand (cf. figure 1). 

We have to provide a function that computes the new fact (repair cases by 
brand) based on the old dimensions (without brand) and the (old) number of repair 
cases. The old number of repair cases could be repair cases for a specific brand 
(alternative 1 in figure 1), a summarization over all brands (alternative 2), or other. 
The idea how the new values can be computed is stored in nv. For example, if we 
only had BMW cars before, then we would use the old fact value for BMW and _L 
for all other cars (because the values cannot be computed, alternative 1). If the old 
value was a sum over all brands, we could only take this value as a sum, whereas 
values for the single brands are unknown (corresponding to “?” in figure 1). 
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figure 1: different alternatives for the instance adaptation 

More formally, the situation before is that c(x,,...,xj=z. Now, after insertion of 
the new dimension we have c(Xj,...,x^,x_^^j) = y V x^^^^e dom(l) and thus 



nv : codom{f) x dom(l) codom{f), nv(z,x„+i) = y g dom{l) 

Schema: 1U’=< F, F, A gran’, class, attr> 



■(/):= I for 

\gran(f)u{l} f = fin. 

Instances: d’m’ = <R-UP, C’, AV>, R-UP not changed, AV not changed. 



gran’(f) : F ^ 2^ 



gran 
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: dom (f.J codom (f-J, 

C (Xi , . . . , x„ , ) = nv{c ) with e dom(l) 



14. delete dimension: deletes a dimension (connected to the fact at level 1) from a 
fact. Parameters: level name 1 e L and fact name f^^^ e F. Additionally, an aggre- 
gation function agg^ has to be provided which defines how the fact values have to 
be aggregated over the deleted dimension (e.g. summation). 
nV =< F, L, A gran’, class, attr> 



gran(f):F^2 gran {/):=< for 

[gran(f)-{l} f = fdei 

Instances: d’m’ = <R-UP, C’, AV>, R-UP not changed, AV not changed. 
^fdel ' ■ dom (fj codom (fj. 



Cfd^iix^,..., ) = agg^^(Cf^Jx^,...,xJ) with x^ e dom{l) 



6 Tool Support for the Evolution Process 

When a considerable amount of requirements change for an MDIS that is operational, 
a new iteration of the development process is initiated (see section 1). First, the new 
and changed requirements are compiled and documented. The schema evolution proc- 
ess begins when these changed requirements are to be incorporated into the conceptual 
data model. Our vision is that the warehouse modeller works with the conceptual 
schema in a design tool based on a graphical representation (e.g. using the ME/R no- 
tation [18]). The tool enables the warehouse modeller to successively apply evolution 
operations tp;,... tp„ to the schema. When the designer is satisfied with the results, he 
commits his changes. At this time, the system checks the integrity of the resulting 
model and propagates the changes of the schema and instances to the implementation 
level, e.g. by transforming the evolution operations to a sequence of SQL commands. 
Thus, the warehouse modeller does not need to have knowledge about the specific 
implementation because the tool- supported environment allows him to work purely on 
the conceptual level. 



7 Conclusions and Future Work 

We suggested a framework for multidimensional schema evolution that enables the 
design of a tool supported schema evolution environment on a conceptual level. In 
this paper, we presented the core of this framework: a schema evolution algebra based 
on a formal description of multidimensional schema and instances. This algebra offers 
14 atomic evolution operations that can be used to build more complex evolution op- 
erations. Furthermore, we described how this formalism can be embedded into the 
iterative tool-based design and maintenance process of multidimensional information 
systems. The future research work of our group will investigate the automatic propa- 
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gation of effects to different implementations (e.g. multidimensional, relational) tak- 
ing into account effects on predefined aggregates, indexing schemes and predefined 
reports. 
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Abstract. In OLAP models, or data cubes, aggregates have to be recalculated 
when the underlying base data changes. This may cause performance problems 
in real-time OLAP systems, which continuously accommodate huge amounts of 
measurement data. To optimize the aggregate computations, a new consistency 
criterion called the tolerance invariant is proposed. Lazy aggregates are aggre- 
gates that are recalculated only when the tolerance invariant is violated, i.e., the 
error of the previously calculated aggregate exceeds the given tolerance. An in- 
dustrial case study is presented. The prototype implementation is described, to- 
gether with the performance results. 



1 Introduction 

Traditional OLAP (on-line analytical processing) [3] technology is developed mainly 
for commercial analysis needs. Data from various sources is compiled into multiat- 
tribute fact data tuples. They are used to compute multidimensional aggregates that 
make up the essence of OLAP. 

An industrial process is often a subject of extensive analysis, too. However, the 
traditional OLAP technology is insufficient for such analysis. It lacks support for real- 
time operation and efficient time series analysis. Usually, the industrial analysis 
model, comprising of measurement data and derived data, must be up-to date instantly 
when the contents of the underlying data sources change. We claim that a certain level 
of temporal consistency [8] must be maintained, in an industrial analysis model. 

In this paper, we present the requirements for Industrial OLAP (lOLAP) for indus- 
trial process analysis. We analyze problems encountered in a case study, having to do 
with handling time series data and maintaining consistency of the analysis structure in 
real-time. We introduce a method to decrease the amount of computation required to 
update the analysis model, by allowing the model to have some value-inaccuracy. We 
call this the lazy aggregation method. We present an example of a typical industrial 
application and summarize experimental results of applying the method. 
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A theoretical abstraction of multidimensional analysis model, the data cube, was 
refined in [4], The basic computation models for data cube were reviewed in [1], A 
model for decomposition of data cube into a lattice was presented in [5], Data cube 
maintenance and refreshing issues were addressed in [7]. We already introduced a 
time series database platform supporting active ECA rules [1 1], 

The paper is organized as follows. In Section 2, we discuss lOLAP requirements. 
In Section 3, we introduce the concept of accuracy-based consistency enforcement 
and propose the lazy aggregate method. In Section 4, we focus on a case study and the 
prototype implementation. We conclude in Section 5. 

2 lOLAP Model 

Similarly to traditional OLAP models, the lOLAP data model involves base data and 
aggregate data. The base data reflects the current and past state of a process and is 
usually stored in a process database. An example of a process database management 
system implementation is given in [1 1]. 

Calculable data cubes may be represented as nodes in a combined aggregate lattice 
[5]. The lattice has n+1 levels where n is the maximum number of dimensions. An 
example of a 4-level aggregate lattice is given in Figure 1 . 




Figure 1. The aggregate lattice 



The top level-3 (/ = « = 5) aggregates are directly calculated from base data. They 
may be expressed as a tuple < diniA, dims, dime, fact> where fact is a value of an ag- 
gregate function over dimensions A, B, and C at a point (dimA, dims, dime) of ^ 3- 
dimensional space. We chose to represent both the base and aggregate data using the 
relational model. For example, if the base data is stored in a relational table having the 
schema (id, dimA, dimB, dimC, value), the level 3 node for the AVG function is pro- 
duced with the SQL query: 

SELECT dimA, dimB, dimC, AVG (value) 

EROM base_table 

GROUP BY dimA, dimB, dimC; 

The lower level aggregates can be calculated from precalculated upper-level aggre- 
gates. We say that the aggregates at level / are calculated from the source data at level 
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/-i-7(l = 0, n-1) and the level n is aggregated from the base (fact) data. The zero- 

level node is a single value aggregated over the whole base data set. The lattice can be 
also called an N-cube meaning a collection of N cubes where N = 2". 

The lattice may be materialized to a various extent. In two extreme cases, the ag- 
gregate data may be totally calculated at query time or they may be totally material- 
ized in advance. There has been some research effort aimed at maintaining lattice 
consistency in a most efficient way, once the lattice is being refreshed [5, 7]. 

We take a different approach in this work. In lOLAP, the goal is to be able to per- 
form real-time analysis using the latest process data. The temporal consistency of the 
analysis data is expected to be within sub-second range. We thus assume the lattice is 
fully materialized at all times, and we strive to avoid "insignificant" recalculations, 
i.e. such aggregate recalculations that would not be required from the accuracy point 
of view. 

We propose a new consistency criterion that is based on numerical accuracy and 
leads to the concept of lazy aggregation. A user sets an allowable tolerance for an 
aggregate calculation error. Consequently, an aggregate value need not be recalcu- 
lated if the existing value represents the current process within the given tolerance. 



3 Lazy Aggregates 



3.1 Definitions 

Definition 1: Error band. Error band (|)v of variable v is defined as a maximum de- 
viation of the measured value V[ from a specific reference norm. 

(|)v = max ( abs( e(vi ) ) ), 

i 

where i spans all measured occurrences of v, and e( v, ) is an actual measurement error 
of V,. Error band is typically expressed as an absolute percentage of full scale, e.g. 
10%, and it is normally associated with a measurement equipment. 

Definition 2: Tolerance. Tolerance of a variable v is an acceptable degree of 
variation of v. It is also usually expressed as an absolute percentage of full scale. 

The above two concepts seem to be similar but there is a semantic difference between 
them: the error band stems from the physical characteristics of the measuring equip- 
ment, and the tolerance is an externally given requirement. Note, that we expect the 
measurement system to maintain, at any time the relation 

(|)v < OCv for every variable v. (2) 

Sometimes the error band values are not known, and the tolerance values are used 
instead. Eor simplicity of notation, we also assume that = tty, for base data. For the 
purpose of evaluation of the error band of the aggregate values, we introduce the 
concept of the base aggregate error band. 
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Definition 3: Base aggregate error band (BEB). Base aggregate error band BEBa of 
an aggregate a is a maximum deviation of the aggregate value from a real exact value. 
It is expressed as an absolute percentage of full scale. It is defined as a function: 

BEBa=f(S,A), (3) 

where S = (Sj I j = 1, mj is the set of available source values, A = {aj \ j = 1, 
mj is the set of the corresponding tolerance values, and m is the size of the source 
value set. There is a binary (correspondence) relation < Sj, aj> czR. 

As we deal with the effect of propagating errors from function arguments to the func- 
tion result, various approaches to error propagation may be used in order to establish 
an appropriate BEB function for each aggregate. For example, if we treat measure- 
ment values as value intervals, the interval arithmetic [6] may be applied. 

BEB functions for some aggregates are trivial. For example, for the Maximum ag- 
gregate: 

BEB max = CCj such as MAX(S) = Vj . (4) 

In the case of the Average aggregate, the following approximation may be sufficient: 

BEBavg = AVG(A). (5) 

We use BEB to characterize each lattice node element in terms of its maximum accu- 
racy, i.e. the accuracy achieved when the node element value (aggregate value) is 
recalculated form the source data having some tolerance. We now proceed to charac- 
terize the error induced hy not recalculating the aggregate. 

We introduce the concept of an actual error band to reflect the instantaneous error 
band of an aggregate. It is recalculated each time there is a change to source data. 

Definition 4: Actual aggregate error band (AEB). Actual aggregate error band 
AEB of a node element is defined as 

AEB^ = BEB^ -I- abs( ) (6) 

i 

where Si is the error delta calculated for each i-th change (transaction) affecting a 
source data variable: 



S=fa(Vi,V-oldi, PJ (7) 

where v, and v-oldi are the new and old values of the source data and is the power 
(cardinality) of the set of v, valuer being aggregated. The function is aggregate-spe- 
cific. The value of i is set to zero each time the aggregate value is recalculated. 
Evaluation of AEB is triggered hy any replacement of v-oldi with v,. 

As a consistency criterion, we propose to maintain, in a lattice of aggregates, the 
following invariant: 

Definition 5: Tolerance invariant (consistency criterion). For each j-th element at 
the p-th lattice node, the following holds: 
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\/ AEB^ <a^ 

p,j ^ 

where ttp is a tolerance value associated with the aggregate at the node p. 

3.2 Computational Model 

The idea of the lazy aggregates is to delay the aggregate recalculation until the actual 
error hand of the aggregate value exceeds the given tolerance. We propose to main- 
tain tolerance invariant by way of a standard EC A rule (trigger) mechanism [9]. An 
ECA rule is associated with each aggregate at levels 1, n. The aggregate recal- 
culation at any level takes place in the following steps. 

1. Event detection: a change of data source at level / is detected. 

2. Condition evaluation: the tolerance invariant at level 1-1 is checked 

3. Action execution (conditional): the aggregate at level 1-1 is recalculated 

As a result, the change in the base data is propagated downward the lattice in an opti- 
mized way. A possible scenario of aggregate recalculation is shown in Eigure 2. 



Accuracy % ^ 



Tolerance 

10% 




BEB(tl) 



BEB(t4) 




13 



14 



15 






1 



Figure 2. An example of lazy aggregate computation 

In the example, at time tl, the aggregated power consumption is calculated and the 
base error band (BEB) is evaluated. At t2 and t3, the new measurement values are 
arrived and the evaluation of the tolerance invariant is fired. The corresponding error 
deltas are calculated but no aggregate recalculation is triggered as the resulting actual 
error band (AEB) is still within the tolerance. At t4, the accumulated AEB exceeds 
the tolerance value, which leads to aggregate recalculation and re-evaluation of BEB. 
At t5, again, the tolerance invariant holds, and no aggregate recalculation is needed. 
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4 Implementation and Results 

4.1 Case Study 

ABB Industry Oy is a leading manufacturer of electric drive systems for heavy indus- 
try. A typical product is a paper machine drive system [2]. It consists of few tens of 
high-power electric motors (drives), together with the associated frequency convert- 
ers, and a control and supervision system. In the case study, the requirement is to be 
able to survey the overall drive operation in real-time. Drives are combined into drive 
sections which, in turn, are grouped according to machine parts such as wire, press 
and dryer. Several paper machines may be surveyed at a factory or at different loca- 
tions, at the same time. Possible dimensions to be used in motor behavior analysis are 
thus: section, machine part, machine, location, power range, type and manufacturing 
year. Some of the dimensions may be considered different granularities of a single 
dimension. For example, machine part, machine and location may be granularities of 
the geographical dimension. 

The variables measured at each motor are, typically, temperature, power and 
torque. The measured values constitute the base data of the case study lOLAP model. 
The data is collected at typical rate of one measurement record (a tuple of all meas- 
urement values) per second per motor. For a 100-motor installation, the update rate of 
100/s is attained, for any variable type. If we required that the lattice node values are 
recalculated each time a source value changes, the required aggregate recalculation 
rate would be (100 • 2" )/s = 1600/s. Such a rate would not be feasible on a low-cost 
PC equipment. We apply the lazy aggregation method to reduce the recalculation rate 
significantly. 

4.2 Prototype Implementation 

We have implemented a prototype of a general-purpose N-cube server called Ruble 
[10]. Ruble is based on the existing RapidBase active time series database system 
[12]. The aggregates in the data cube lattice are organized as relational database ta- 
bles. Triggers are associated with all time series and aggregates. 

All the functionality of checking the tolerance invariant and computing the error 
band values is implemented in a detached Ruble Aggregate Engine process. When 
any value in time series or an aggregate node is changed, an appropriate trigger is 
fired and the aggregate engine handles the action. The action execution starts with the 
tolerance invariant checking and, depending on the result, either the actual error band 
is updated or the lattice node value is recalculated. 

The analysis data is generated with a process data generator, which feeds new 
measurements values for each motor periodically. The aggregated results are analyzed 
using some general-purpose reporting tool connected to RapidBase via the ODBC 
driver. 
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4.3 Performance Results 

The test scenario consists of 100 motors and the power of each motor is measured. 
The measurements are assumed to have a random walk behavior and each motor is 
handled independently. 

The motors are analyzed on the basis of four dimensions {type, power range, year 
of manufacturing, and machine part). Thus, the used aggregate lattice contains 2"^=16 
tables, which are all materialized. The experiment is performed separately for two 
aggregate functions, AVG(power) and SUM(power). These functions were selected to 
represent distributive and algebraic aggregate function groups [4], respectively. We 
do not address holistic aggregate functions in our experiment. The arrival rate of the 
measurement values for each motor is 1/s. The tolerance values are varied within a 
range of 2% to 20%. 

During the test run, the number of lattice recalculations is measured. The lattice re- 
calculation is performed, when an updated error band exceeds the given tolerance. 
The result is expressed as a lattice recalculation percentage that is calculated with the 
following formula (for a given period of time): 

no._of _lattice _element _recalculations (9) 

RECALC% = — r 

- ™ * YiQ _of _ input _ transactions 

The test was run under Windows NT 4 in a 333 MHz Pentium Pro PC with main 
memory of 128Mb. The experimental results are presented in Figure 3. 




The experimental results begin with a tolerance value of 2%. The lower values were 
not attainable due to performance limitations. It can be clearly seen that, by using 
affordable tolerance level, say 5%, it is possible to reduce the number of lattice recal- 
culations drastically. Furthermore, regardless of the aggregate function type, the per- 
formance is alike. However, neither function is demanding in terms of complexity nor 
are their BEB and AEG functions. More complex algebraic aggregate function may 
cause indeterminate performance degradation when using the lazy aggregate method. 
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5 Conclusions 

We discussed the requirements for using OLAP in industrial analysis, and we found 
the OLAP concept useful for the purpose. However, the traditional OLAP approach 
must be enhanced to fulfill industrial analysis needs. Typical industrial data is time- 
based and it has a fast arrival rate. The accuracy of the data at various levels of the 
OLAP structure (data cube) varies in time as the underlying base data changes. We 
took the advantage of this phenomenon to utilize the accuracy of computations in the 
performance optimization scheme called lazy aggregates. Lazy aggregates are the 
aggregates that are calculated only if the accuracy of the previously calculated values 
is not within given tolerance. We defined the concepts of error band and the consis- 
tency criterion in the form of the tolerance invariant. We propose, how the lazy ag- 
gregate method can be implemented using an active main memory database. We pro- 
vide a case study based on a paper industry application. We also provide a prototype 
implementation of a general-purpose N-cube server called Ruble. We analyze our 
lazy aggregate approach using Ruble and we find that, by using the lazy aggregate 
concept, it is possible to perform complex industrial analysis on a standard PC plat- 
form in the presence of fast data acquisition. 
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Abstract. A first attempt to extract association rules from a database 
frequently yields a significant number of rules, which may be rather diffi- 
cult for the user to browse in searching interesting information. However, 
powerful languages allow the user to specify complex mining queries to 
reduce the amount of extracted information. Hence, a suitable rule set 
may be obtained by means of a progressive refinement of the initial query. 
To assist the user in the refinement process, we identify several types of 
containment relationships between mining queries that may lead the pro- 
cess. Since the repeated extraction of a large rule set is computationally 
expensive, we propose an algorithm to perform an incremental recompu- 
tation of the output rule set. This algorithm is based on the detection of 
containment relationships between mining queries. 



1 Introduction 

Association rules allow the detection of the most common links among data 
items in a large amount of collected data. The number of extracted association 
rules may be huge, and not all the discovered associations may be meaningful for 
a given user. Several high level languages have been proposed [3, 5] to allow the 
user to specify accurate extraction criteria, following which the association rules 
are searched in the source data. These languages provide to the user a powerful 
instrument for both the specification and the refinement of mining queries, in 
order to better capture the user needs. 

The design of an association rule mining query is a complex task in which 
we can distinguish two phases: 

- Identification and definition: given a data source and user requirements, the 
correct class(es) of extraction criteria for the problem are identified. Next, the 
appropriate mining query is defined. We do not expect the initial output rule 
set to satisfy completely the user requirements. 

- Refinement: the initial query is iteratively refined in order to better meet 
(possibly changing) user needs. 

The identification and definition phase is extensively discussed in [2], where 
a correspondence between classes of extraction criteria and the corresponding 
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mining queries is presented. In this paper we analyze the refinement phase. 
Given a mining query, we study the effect of modifications performed in order to 
better capture the user needs, either because the previous formulation did not 
completely match, or because they evolved in time. 

We first identify a containment relationship between mining queries, called 
dominance. This relationship aims to identify a containment relationship be- 
tween the output rule sets extracted by the compared queries. Next, we identify 
classes of refinement criteria and we characterize the relationship between pro- 
gressively more complex queries inside each class by means of the above proper- 
ties. When the query refinement satisfies the identified relationship, the actual 
rule extraction process may be significantly simplified. In this case, a rule set 
TZ^ extracted by a query M 2 , which refines a query M\ that produced rule set 
TZi , can be incrementally obtained starting from rules in TZi . 

In related work, when the extracted rule set is too large, the most commonly 
proposed technique to help the user in finding relevant rules is a powerful rule 
browsing tool [3, 4]. This solution allows the user to specify only the simplest of 
the refinement criteria we discuss in this paper. In [4] the problem of selecting 
interesting association rules from a large set of discovered rules is addressed in 
a twofold way. Firstly, rule templates describing the structure of the rules the 
user is interested in allow the selection of a subset of the formerly discovered 
rules. Secondly, a graphical tool is proposed, which shows rules in a clever way, 
in order to highlight common parts. In [3], the notion of exploratory mining is 
introduced. Exploratory mining allows the user to extract all possible rules on 
all possible attributes, in order to help the user find interesting rules. Howev- 
er, only a purely practical approach to the problem is presented. An evolution 
of the concept of exploratory mining is provided in [6]: a language to specify 
constrained association rule queries is defined, in such a way it allows pruning 
optimizations performed during the actual extraction of association rules. 

The paper is organized as follows. Section 1.1 describes the association rule 
mining operator MINE RULE [5], on which our discussion is based. Section 2 de- 
fines the comparison relationships among mining queries. In Section 3 the most 
relevant cases of query refinement are discussed, while in Section 4 an algorithm 
for the incremental computation of refined queries is presented. Finally, Section 5 
discusses conclusions and future work. 

1.1 Mining Qneries 

Mining queries for extracting association rules from relational data can be ex- 
pressed by means of a SQL-like operator, named MINE RULE, defined in [5]. This 
section introduces the operator by showing its application to a running example, 
i.e. the Purchase table depicted in Figure 1, containing data about purchases 
in a store. Each purchase transaction is characterized by a unique identifier, a 
customer code and a date. For each bought item, the price and the purchased 
quantity are specified. 

Consider customers that bought less than 10 items. Suppose the user is in- 
terested in extracting rules that associate a set of items (the body) with a single 
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tr. 
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item 


date 


price 


q-ty 




CUStl 


col_shirts 


12/17/96 


85 


1 


1 


CUStl 


hiking_boots 


12/17/96 


180 


1 


1 


CUStl 


jackets 


12/17/96 


102 


1 


2 


ousts 


brown .boots 


12/18/96 


150 


1 


2 


ousts 


col.shirts 


12/18/96 


25 


3 


3 


ousts 


col.shirts 


12/19/96 


99 


2 


3 


ousts 


brown .boots 


12/19/96 


150 


3 


4 


ousts 


jackets 


12/20/96 


50 


3 


5 


ousts 


col.shirts 


12/21/96 


99 


1 



Fig. 1. The Purchase table for a big-store. 



item (the head) bought by a single customer in the same date. In particular, the 
user wants items with price greater than or equal to $100 in the body, and items 
having price less than $100 in the head. Rules are interesting only if their sup- 
port (the frequency of a rule among customers) is at least 0.2 (i.e., at least 20% 
of customers support the rule) and their confidence (the conditional probability 
that a customer buying items in the body buys also items in the head) is at least 
0.3 (i.e., at least 30% of customers supporting the rule body also support the 
whole rule). The following query corresponds to the above problem. 

MIKE RULE FilteredSameDate AS 

SELECT DISTINCT l..n item AS BODY, 1..1 item AS HEAD, SUPPORT, CONFIDENCE 
WHERE BODY. price >= 100 AND HEAD. price < 100 
FROM Purchase 

GROUP BY customer HAVING COUNT (*) < 10 
CLUSTER BY date HAVING BODY. date = HEAD. date 
EXTRACTING RULES WITH SUPPORT: 0.2, CONFIDENCE: 0.3 

The association rules are extracted by performing the following steps. 

Group computation. The source data, contained in the Purchase table (FROM 
clause), is logically partitioned into groups (GROUP BY clause) such that all tu- 
ples in a group have the same value of attribute customer (in general, several 
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attributes are allowed in this clause; they are called grouping attributes). The 
total number of groups in the source data is computed in this step; we denote it 
as G (used to compute rule support). 

Group filtering. Only groups with less than 10 tuples must be considered for 
rule extraction; this is expressed by the group filtering condition in the HAVING 
clause associated to the GROUP BY clause. 

Cluster identification. Each group is further partitioned into sub-groups 
called clusters (see Figure 2), such that tuples in a cluster have the same val- 
ue for attribute date (attributes specified in this clause are called clustering 
attributes). The body (resp. head) of a rule is extracted from clusters, thus ele- 
ments in the body (resp. head) share the same value of the clustering attribute; 
in absence of clusters (the CLUSTER BY clause is optional), rules are extracted 
from the trivial cluster, i.e. the entire group. 

Cluster coupling. In order to compose rules, every pair of clusters (one for the 
body and one for the head) inside the same group is considered. Furthermore, the 
cluster filtering condition in the HAVING clause of the CLUSTER BY clause selects 
the cluster pairs that should be considered for extracting rules. In this case, a 
pair of clusters is considered only if the date of the left hand cluster (called body 
cluster) is the same as the date of the right hand cluster (called head cluster). 
In absence of cluster coupling (optional), all pairs of clusters are valid. 

Mining Condition. Before rule extraction is performed, the tuple predicate 
in the SELECT . . . WHERE clause is evaluated: given a pair of clusters and all 
tuples in the clusters, a rule is extracted only if tuples considered for body and 
head satisfy the predicate. Since this predicate is evaluated during the actual 
rule extraction phase, it is called mining condition. In our example, the mining 
condition selects only items with price greater than or equal to $100 for the 
bodies, and only items with price less than $100 for the heads. This clause is 
optional, and in its absence every tuple combination is valid. 

Rule extraction. From each group, all possible associations of an unlimited 
set of items (clause 1. .n item AS BODY), which is called the premise or body 
of the rule, with a single item (clause 1. . 1 item AS HEAD), which is called the 
consequent or head of the rule are extracted. An example of rule generated by 
the query is {brown-boots, jackets} =>{coLshirts}, where elements in the body and 
in the head are items. 

The number of groups that contain a rule r, denoted as Gr (used to compute 
support and confidence of rules), and the number of groups that contain the 
body of r, denoted as Gf, (used to compute confidence of rules), are computed 
during this step. 

Observe that, for a rule r, Gr denotes the total number of groups, after the 
application of the group filtering condition, which contain a pair of clusters from 
which the rule is extracted, while Gf, gives the same information for the body. 
Support and confidence evaluation The support of a rule is its frequency 
among groups s = Gr/G; the confidence is the conditional probability that the 
rule is found in a group which contains the body c = Gr/Gf,. Support and 
confidence are then a measure of relevance of a rule. If they are lower than their 
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respective minimum thresholds (0 . 2 for support and 0.3 for confidence, in our 
sample query), the rule is discarded. 

The resulting rule set is the following. 

{hiking_boots}=^>{coLshirts}, s = 0.33, c = 1 

{jackets}=^>{coLshirts}, s = 0.33, c = 0.5 

{hiking_bootsjackets}=^>{coLshirts}, s = 0.33, c = 1 
{brown_boots}=^>{coLshirts}, s = 0.66, c = 1 

We finally observe that, in absence of clusters, if body and head are not 
disjoint, rules are tautological (e.g., rule {brownJ)oots,coLshirts}=>{col-shirts} 
is tautological). Since the association is obvious, tautological rules are not ex- 
tracted. When instead clusters are specified, a rule such that body and head are 
not disjoint is tautological only if body and head are extracted from the same 
cluster, since the association is obvious only if body and head refer to the same 
value of the clustering attributes. Thus, for a rule r potentially tautological, 
Gr indicates the number of groups, after applying the group filtering condition, 
which contain at least a pair of distinct clusters from which r is extracted. 

2 Properties of Mining Queries 

Mining queries that extract association rules from a database can be compiled 
into a set of rather complex queries that extract the relevant information from 
the database. In this setting, it is rather difficult to apply known techniques to 
detect interesting relationships (e.g., query containment) between queries. 

Owing to the complex structure of mining queries, it is possible to define 
several different relationships between mining query pairs. In particular, in ad- 
dition to the classical notion of equivalence, an inclusion relationship has been 
identified, which differ in the way support and confidence values for the rules 
are considered. 

The above relationships are formally defined in Section 2.1, as well as their 
properties are described in Section 2.1. These relationships can be used to drive 
the refinement of the rule extraction process (see Section 3). 



2.1 Relationships between Rnle Sets 

Consider two MINE RULE queries M\ and M 2 , which extract rule sets TZ\ and TZ 2 
respectively. To define the relationship between M\ and M 2 , we must consider 
both the sets of rules output by M\ and M 2 and their support and confidence. 

Intuitively, two mining queries M\ and M 2 are equivalent if their output rule 
sets TZi and TZ 2 always contain the same rules and each rule has the same value 
for support and confidence both in TZ\ and TZ 2 ■ 

Definition 1 (Eqnivalence): Let M\ and M 2 be two mining queries, ex- 

tracting from the same source data rule sets TZi and TZ 2 resp.. M\ and M 2 are 
equivalent (Mi = M 2 ) if, for all instances of the source data, each rule r in TZi 
is also in TZ 2 and vice versa, with the same support and confidence. □ 
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A mining query Mi includes query M2 when its output rule set TZi includes 
TZ2, output by M2 and each common rule has the same value of support and 
confidence in both sets. Hence, the relevance of common rules w.r.t. the source 
data is the same. 

Definition 2 (Inclnsion): Let Mi and M2 be two mining queries, that extract 
from the same source data rule sets TZi and TZ2 respectively. Query Mi includes 
M2 (written as Mi D M2) if, for all instances of the source data, each rule r in 
TZ2 is also in TZi, with the same value of support and confidence. □ 

Dominance is a weaker kind of inclusion, in which support and confidence of 
each rule r in TZ2 may be lower than the values for the corresponding rule in TZi . 
Definition 3 (Dominance): Let Mi and M 2 be two mining queries, that 

extract from the same source data rule sets TZi and TZ2 respectively, and si,ci 
(respectively S2,C2) support and confidence of a rule r in TZi (TZ2)- Query Mi 
dominates M2 (written as Mi \> M2) if, for all instances of the source data, each 
rule r in TZ2 is also in TZi and is characterized by S2 < si and C2 < ci . □ 
Theorem 1: Equivalence is a particular case of inclusion; inclusion is a partic- 
ular case of dominance. □ 

Equivalence, inclusion, and dominance meet the transitivity property. 

For the sake of brevity, proofs of theorems in this section, as well as theorems 
in the rest of the paper, are not reported here; they are extensively reported 
in [ 7 ]. 

3 Refinement of Mining Queries 

The first attempt to extract rules from a data source may return a huge amount 
of data. In this case, the user may either browse the extracted rules looking for 
relevant information, or refine the mining query to reduce the amount of returned 
information. This second option seems more promising, because it allows the user 
to progressively restrict the scope of her/his request by specifying new selection 
criteria, that usually cannot be applied if rules are simply browsed. 

In [ 2 ] we identified several classes of relevant association rules based on the 
adopted extraction criterion. The minimal specification for rule extraction re- 
quires the indication of the basic rule features, i.e., attribute(s) from which the 
rules are extracted and body and head cardinality, group (s) from which rules 
are extracted, and minimum thresholds for support and confidence. Further- 
more, the following orthogonal extraction criteria (which correspond to optional 
clauses of the MINE RULE operator) may be specified: 

- Mining conditions, which allow the specification of filtering conditions specifi- 
cally on head and/or body of the rules to be extracted. 

- Clustering conditions, which allow the partitioning of groups into subgroups 
(clusters) with common features, which are then coupled and on which further 
conditions may be specified. 

- Group filtering conditions, which allow the selection of a subset of groups on 
which mining is performed. 
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The above criteria may guide the user in refining an initial mining query, 
by adding progressively more restrictive conditions. In the following sections, 
given a mining query M\ and a refinement M2 of it, we analyze the relationships 
that hold between M\ and M2. In particular, in Section 3.1 we discuss modifi- 
cations of the basic features of association rules, while in the following sections 
we separately explore the effects of mining conditions and clustering condition- 
s. For brevity, we do not consider group filtering conditions, because ([7]) the 
dominance relationship is not met. 

3.1 Rule Features 

In this section, we discuss the effect of modifying the basic characteristics of 
association rules, i.e., support and confidence thresholds, body and head cardi- 
nalities, and specification of rule attributes. 

Support and Confidence Thresholds. The minimum thresholds for support 
and confidence determine the number (and relevance) of extracted rules. These 
parameters are usually calibrated by the user for each particular application. 
Indeed, excessively high minimum thresholds may significantly reduce the num- 
ber of extracted rules, causing important information to be lost. In contrast, 
excessively low minimum thresholds yield a huge number of rules, most of which 
are not meaningful since not sufficiently frequent. 

Theorem 2: Let Mi and M2 be two mining queries, identical apart from mini- 
mum support (resp. si and S2) and minimum confidence (resp. ci and C2) thresh- 
olds. If Si < S2 and ci < C2, then Mi D M2. □ 

Body and Head Cardinalities. The choice of the appropriate minimum and 
maximum cardinality for body and head may be the result of a refinement pro- 
cess. An inclusion relationship holds, as proved by the following theorem. An 
identical result holds for the cardinality of the head. 

Theorem 3: Let Mi and M2 be two mining queries, identical apart from the 
cardinality of the body, whose minimum and maximum values are bi and Bi, 
and 62 and B2 resp.. If bi < 62 and B\ > B2, then Mi D M2. □ 

3.2 Mining Condition 

The mining condition allows the user to impose filtering conditions either sep- 
arately on the rule body and/or head, or correlating them. The absence of a 
mining condition is interpreted as an always true condition. Hence, both the 
case in which a mining condition is added, and the case in which a mining con- 
dition is refined, can be treated in the same way. The following theorem states 
that there is a dominance relationship between queries when implication holds 
between their respective mining conditions. 

Theorem 4: Let Mi and M 2 be two mining queries, identical apart from the 
respective mining conditions toi and TO2- If m 2 => rrii^ , then Mi > M 2 . □ 

^ We denote with =► predicate implication. 
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In the following particular case, the inclusion relationship holds. 

Corollary 1: Let M\ and M2 be two mining queries, identical apart from the 
respective mining conditions, denoted as m\ and m2 respectively. Consider m2 = 
TOi A p. If p is a simple comparison predicate between a rule attribute and a 
constant or between two rule attributes, then M\ D M2- □ 



3.3 Clusters 

Clusters are further partitions of groups based on the values of the cluster- 
ing attribute(s). When a clustering attribute is specified, rules are extracted 
from couples of clusters inside the same group. Further conditions may be spec- 
ified on the clustering attributes by means of the HAVING clause. The query 
FilteredSameDate in Section 1.1 is an example of clustering on the attribute 
date with a cluster filtering condition. In the following, we discuss the effect 
caused by the introduction of clustering. 

Adding Clusters. Consider two mining queries M\ and M 2 , where M 2 differs 
from Ml only by the addition of a clustering attribute. In M\ only the trivial 
cluster, that coincides with the group itself, is considered and rules are extract- 
ed from pairs of groups. In M2 instead, clusters introduce a further partitioning 
of groups. Rules are now extracted from the pairs of clusters identified in each 
group. Hence, given the same source data, the generated rule sets may be sig- 
nificantly different. Two opposite effects may be caused by the addition of a 
cluster. New rules can be generated: rules that are discarded as tautological for 
Ml because body and head are not disjoint, are not tautological any more for 
M2 if body and head are extracted from different clusters. In contrast, rules may 
disappear from the output rule set: since the size of clusters in a group is smaller 
than the group itself, the same rule r extracted by M2 may have lower values 
for support and confidence than that extracted by Mi . Then, if r does not meet 
any more the support or confidence threshold, it is not extracted. Hence, we can 
conclude that the addition of a clustering atribute leads to a query M2 which is 
uncomparable with Mi . 

Cluster Filtering Condition. The cluster filtering condition selects cluster 
pairs inside groups from which rules are extracted. Its absence can be viewed 
as the true predicate. Consider two mining queries Mi and M2, with cluster- 
ing conditions hi and /i2, respectively. The following theorem shows that Mi 
dominates M2 if /i2 is more restrictive than hi . 

Theorem 5: Let Mi and M2 be two clustered mining queries, identical apart 
from the respective clustering conditions, denoted as hi and /i2 respectively. If 
/i2 => hi, then Mi > M2. □ 

4 Incremental Computation 



The query refinement process may cause repeated extractions of association rules 
with slightly different features. Since, as shown in several works [ 1 ], the rule 
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Input: a source relation r, the rule set I. 

Output: the dominated rule set R. 

begin 

for each rule r in / do 

r.rule_groups := 0; r.body .groups := 0; insert r into i?; 
end 

for each valid group p in r do 

compute the set Cg of clusters cj in g\ 
compute the set Pg of valid cluster pairs p* in p; 
for each rule r in i? do 

if r is in some pk € Pg then r.rule_groups := r.rule_groups + 1; 
if r is in some Cj € Cg then r.body .groups := r.body .groups + 1; 
end 
end 

for each rule r in i? do // Rule Selection Phase 
r.support := r.rule.groups / AllGroups; 
r.confidence := r.rule.groups / r.body .groups; 

if r.support < min.support or r.confidence < min.conf idence then 
discard r from R- 

end 

return i?; 
end. 



Fig. 3. Algorithm for incremental computation in case of dominance. 



extraction process is typically a computationally expensive task that requires 
strongly optimized extraction techniques, when a query is refined, it becomes 
important to perform this task incrementally from a given rule set. 

In the general case, incremental recomputation of a new rule set is not pos- 
sible, but detecting the properties of inclusion and dominance allows us to sig- 
nificantly simplify the extraction process. In particular 

Inclusion: given two mining queries such that Mi D M2, the second rule set can 
be obtained from the first rule set without scanning again the source data; 
Dominance: given two mining queries such that Mi>M2, the second rule set can 
be obtained from the first one by means of a single, simple pass over the source 
data, with a very low complexity in space. 

Algorithm. The dominance relationship Mi > M2 is characterized by the fact 
that rules appearinxg in both output rule sets have lower support and confidence 
in the second rule set than in the first. Consequently, it is necessary to recompute 
support and confidence, but only for the rules contained in the first rule set. 

As a consequence, it is not necessary to perform again the complete rule 
extraction process: a single pass over the source data is sufficient, and can be 
performed by a program implementing the algorithm of Figure 3. 

The complexity in space of the algorithm can be computed as follows: the 
first rule set contains m rules, and body or head of each rule contains at most 
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n elements. In order to store all rules, memory for {n + n) x m elements is 
required. Thus, the complexity in space is 0{n x to). Observe that this result 
improves substantially with respect to the exponential complexity of the general 
extraction algorithms. 



5 Conclusions and Future Work 

This paper addresses the problem of incrementally refining association rules 
mining. Since several complex languages for the specification of mining queries 
are becoming now available, it is very important to allow the user to progressively 
refine the initial mining query with growingly complex criteria. 

We initially define the relationships of equivalence, inclusion and dominance 
between two mining queries M\ and M2 ■ An extensive analysis of the application 
of these comparison relationships to mining query pairs M\ and M2 allows us to 
identify several cases of query refinement in which some of the above relationships 
hold. Furthermore, we show that the rule set TZ2 output by M2 can be obtained 
by means of an incremental recomputation technique based on the knowledge of 
rule set TZi output by M\. 

We finally observe that, although our mining queries are expressed by means 
of the MINE RULE operator [5], the proposed techniques can be applied to any 
language providing the same expressive power. Furthermore, we believe that 
incremental recomputation algorithms can exploit some intermediate results of 
the extraction process. We are currently exploring the benefits of this knowledge 
in the specific context of the MINE RULE prototype. 



References 

1 . R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large 
databases. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994. 

2. E. Baralis and G. Psaila. Designing templates for mining association rules. JIIS 
Journal of Intelligent Information Systems, 9:7 - 32, 1997. 

3. T. Imielinski, A. Virmani, and A. Abdoulghani. Datamine: Application program- 
ming interface and query language for database mining. KDD-96, 1996. 

4. W. Klementtinen, H. Mannila, P. Romkainen, H. Toivonen, and A. I. Verkamo. 
Finding interesting rules from large sets of discovered association rules. Third In- 
ternational Conference on Information and Knowledge Management, 1994. 

5. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association 
rules. In Proceedings of the 22st VLDB Conference, Bombay, India, 1996. 

6. R. Ng, L. Lackshmanan, J. Han, and A. Pang. Exploratory mining and pruning op- 
timizations of constrained associations rules. In Proceedings of the ACM-SIGMOD 
98, Seattle, Washington, USA., June 1998. 

7. G. Psaila. Integrating Data Mining Techniques and relational Databases. Ph.D. 
Thesis, Politecnico di Torino, 1998. 




The Item-Set Tree: A Data Structure for Data Mining* 

Alaaeldin Hafez*, Jitender Deogun^, and Vijay V. Raghavan* 



Abstract. Enhancements in data capturing technology have lead to exponential 
growth in amounts of data being stored in information systems. This growth in 
turn has motivated researchers to seek new techniques for extraction of 
knowledge implicit or hidden in the data. In this paper, we motivate the need 
for an incremental data mining approach based on data structure called the item- 
set tree. The motivated approach is shown to be effective for solving problems 
related to efficiency of handling data updates, accuracy of data mining results, 
processing input transactions, and answering user queries. We present efficient 
algorithms to insert transactions into the item-set tree and to count frequencies 
of itemsets for queries about strength of association among items. We prove 
that the expected complexity of inserting a transaction is ~ 0(1), and that of 
frequency counting is 0(n), where n is the cardinality of the domain of items. 



1 Introduction 

Association mining that discovers dependencies among values of an attribute was 
introduced by Agrawal et al.[l] and has emerged as a prominent research area. The 
association mining problem also referred to as the market basket problem can be 
formally defined as follows. Let I = {ii,i2, ■ ■ ■ , ij be a set of items as 5 = fsj, S2, . . ., 
s,„j be a set of transactions, where each transaction 5 is a set of items that is v, c 

I. An association rule denoted by X => Y, where X,Y (Z I and X n Y = 0 , describes 
the existence of a relationship between the two itemsets X and Y. 

Several measures have been introduced to define the strength of the relationship 
between itemsets X and Y such as support, confidence, and interest. The definitions of 
these measures, from a probabilistic model are given below. 

I- Support {X ^ Y) = P{X ,Y) , or the percentage of transactions in the 
database that contain both X and Y. 

II. Confidence {X ^ Y) = P(X ,Y)/ P(X) , or the percentage of transactions 
containing Y in transactions those contain X. 

III. Interesti Y) = P{X ,Y) ! P{X)P{Y) represents a test of statistical 
independence. 



This research was supported in part by the U.S. Department of Energy, Grant No. DE-FG02- 
97ER1220, and by the Army Research Office, Grant No. DAAH04-96- 1-0325, under 
DEPSCoR program of Advanced Research Projects Agency, Department of Defense. 

’ ahafez(raghavan)@cacs.usl.edu. The Center for Advanced Computer Studies, University of 
SW Louisiana, Lafayette, LA 70504, USA. 

^ Deogun@cse.unl.edu, The Department of Computer Science, University of Nebraska, 
Lincoln, NE 68588, USA. 

Mukesh Mohania and A Min Tjoa (Eds.): DaWaK’99, LNCS 1676, pp. 183-192, 1999 
© Springer- Verlag Berlin Heidelberg 1999 



184 



A. Hafez, J. Deogun, and V.V. Raghavan 



Many algorithms [1,2, 3, 4, 5, 6, 7, 8], have been proposed to generate association rules 
that satisfy certain measures. A close examination of those algorithms reveals that the 
spectrum of techniques that generate association rules, has two extremes: 

• A transaction data file is repeatedly scanned to generate large itemsets. The 
scanning process stops when there are no more itemsets to be generated. 

• A transaction data file is scanned only once to build a complete transaction lattice. 
Each node on the lattice represents a possible large itemset. A count is attached to 
each node to reflect the frequency of itemsets represented by nodes. 

In the first case, since the transaction data file is traversed many times, the cost of 
generating large itemsets is high. In the later case, while the transaction data file is 
traversed only once, the maximum number of nodes in the transaction lattice is 2" , n 
is the cardinality of /, the set of items. Maintaining such a structure is expensive. 

Many knowledge discovery applications, such as on-line services and world wide 
web, require accurate mining information from data that changes on a regular basis. In 
world wide web, every day hundreds of remote sites are created and removed. In 
such an environment, frequent or occasional updates may change the status of some 
rules discovered earlier. Also, many data mining applications deal with itemsets that 
may not satisfy data mining rules. Users could be interested in finding correlation 
between itemsets, not necessarily satisfying the measures of the data mining rules. 

Discovering knowledge is an expensive operation. It requires extensive access of 
secondary storage that can become a bottleneck for efficient processing. Running data 
mining algorithms from scratch, each time there is a change in data, is obviously not 
an efficient strategy. Building a structure to maintain knowledge discovered could 
solve many problems, that have faced data mining techniques for years, that is 
database updates, accuracy of data mining results, performance, and ad-hoc queries. 

In this paper, we propose a new approach, that represents a compromise between 
the two extremes of the association mining spectrum. In the context of the proposed 
approach two algorithms are introduced. The first algorithm builds an item-set tree by 
traversing the data file once, that is used to produce mining rules. While the second 
algorithm allows users to apply on-line ad hoc queries on the item-set tree. 

The item-set tree approach is introduced in section 2. In section 3, counting 
frequencies of itemsets is given. The item-set tree approach is evaluated and the paper 
is concluded in section 4. 

2 The Item- Set Tree 

The item-set tree T is a graphical representation of the transaction data file F. Each 
node 5 £ T represents a transaction group s. All transactions that are having the same 
itemset, belong to the same transaction group. Let I=(i], i 2 ,...,ij be an ordered set of 
items. For two transactions Si=(a], a2,...,aij and Sj=(b], b2,...,bi^j, let s,<5,- iff Op< bp for 
all l<p<min(l,k). We call I and k, the lengths of and Sj, respectively. 

Each node in tree T represents either an encountered transaction, i.e., a 
transaction in the transaction file, or a subset of an encountered transaction. Node si is 
ancestor node of node Sj, if Si cf Sj that is Si={ai, a 2 ,...,aj and Sj={ai, a 2 ,...,aj, for 
some l<k. Moreover a node s, direct ancestor of node Sj if is an ancestor of sj and 
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there is no other node Sk such that Sid’ SkC^ Sj. Frequency of a node s is denoted by f(s) 
representing the count of transactions that have the same transaction group The 
item-set tree is constructed by transactions inserting process: The root node r 

represents the null itemset {}. A transaction s is inserted by examining (in order) the 
children of the root node r. Each time a node is inserted , f(r) is incremented by 1. 
The insertion process successfully ends with one of the following cases. 

Case 1: All nodes Sj (children of r) are such that these do share no leading elements 
in .y. When a leaf node s is inserted as a son of r, f(s) is initiated to 1. 

Case 2: s=Sj , the node already exists . ) is incremented by 1. 

Case 3: s d Sj ^ s is an ordered subset of node Sj. A node s, representing is inserted 
as a child of r and as a parent of Sj.f(s) =f(Sj) +1 . 

Case 4: Sj d s, node Sj is an ordered subset of s. The subtree, that has Sj as a root, is 
examined and the procedure starts over again 

Case 5: s rf Sj # (j) ,there exists an ordered intersection between i and Sj. Two nodes 
are inserted. A node St, Si =s rf Sj, is inserted between r and Sj, and a node 5 is inserted 
as a child of Si.f(Si) =f(Sj)+l, and f(s) is initiated to 1. 

Algorithm Construct (s,T) 
s is an input itemset 
T is the itemset tree 
begin 

r=root(\) 
increase f(r) 

\is = items (r)lhen exit 

choose Ts=st/dfree(r) such that s and items(roo^s)) are comparable 
if Ts does not exist then 

create a new son xfor r, items(x) =s and f(x) =1 
else if roof(Ts)c® s then call Construct (s, Ts) 
else if s c®roof(Ts) then 

create a new node x, as a son of rand a father of roofCTs), 
items(x) =s and f(x) = f(root(Ts))+1 

else create two nodes xand y, xas the father of root(\s),s.litems(x) = s id® 
root(Js), f(x) = /{roof(Ts))+1, and y as a son of x, s.t., items(y) = s , 
f(y) = ^ 

end 

Figure 1: Algorithm Construct 

Example 1: Let I={1,2,3,4} and F=({1,2,3,4}, [1,2), [1,3), {2,3}} be a transaction 

file that has 4 transactions. In this example, we assume that all transaction in the 
transaction file F have occurred only once. The item-set tree T is fully constructed in 
4 steps (for the 4 transactions). Various steps of the solution are shown in Figures 2. 

Inserting all transaction of the transaction data file F, using algorithm 
Construct(s,T), requires scanning file F only once. An important characteristic of the 
Construct(s,T) algorithm, is that, no matter what the sequence of the inserted 
transactions is, the item-set tree T is always the same. 

In sections 4.1 and 4.2, we study the performance of algorithm Construct(S,T). 
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Stepl: s={l,2,3,4}, s is added as a 
child oi{} (case 1). 



T 0 f=i 

X {1,2,3,4}f=1 



Step2: s={l,2j, s is added as a child of 
{) and as a father of {1,2, 3, 4} (case 3). 

T (H=2 



*1,2} f=2 



Step3: s={l,3}, Si={l} {si={l,2jrf {1,3}} 
is added as a child of { } and as a father of 
{1,2}, s is added as a child of Si (case 5). 




^ { 1 , 2 , 3 , 4 } 1=1 

Step4: s={2,3}, s is added as a child of 
{} (case 1). 




Figure 2: Steps 1 and 4 of example 1. 

3 Frequency Counting 

In order to answer ad hoc queries, we introduce algorithm Count. Algorithm Count 
calculates the frequency of an itemset s by adding up frequencies of those 
encountered itemsets, that contain s. In the example 2, we demonstrate how to count 
frequencies of itemsets. Algorithm Count is given in Figure 3. 

Algorithm Count(s,T) 

input: An item set s, and an item-set tree T. 

OutputiFrequency f of item set s. 

begin 

r=root(T) 

if scr then f(s)=f(s)+f(r) ; end 
while r<s and last-item(r)<last-item(s) do 
traverse subtrees; T' ,of r 

call Count(sJ') 

enddo 

end 

Figure 3: Algorithm Count 

Example 2: Let Cbe the item-set tree constructed in example 1, and s={2,3} be the 
itemset to be counted. To count the frequency of itemset x, the item-set tree T is 
traversed in order as shown in the following steps, 

• Start from the smallest subtree with root node {1}. In this case, s>{l} & sc2{l}. 
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• The subtree of { 1 } is orderly traversed; starting with node { 1,2}. S>{ l,2}&s c2 {1,2} 

• The subtree of { 1,2} is orderly traversed; starting with node { 1,2, 3,4}. s <z{ 1, 2,3,4}, 
.^ 1 - 

• Go back to next-subtree of {1}, node {1,3}. s <Z {1,3}, and the last element in 
{ 1,2,4} equals the last element in s. No further traversing through this subtree. 

• Go back to next-subtree of {}, node {2,3}. s equals {2,3}. /=1 h- 1, and no further 
traversing through this subtree. The procedure ends with /({2,3})= 2. 

4 Performance Results 

In this section, we study the performance of algorithms Construct and Count. We 
assume that, items are uniformly distributed over all transactions. In section 4.1, we 
give the expected number of nodes in the item- set tree T after inserting N 
transactions. In sections 4.2 and 4.3, the expected number of iterations to insert a 
transaction, and the expected number of iterations to count the frequency of an 
itemset, respectively, are given. In section 4.4, we discuss the results of our analytical 
study. 



4.1 Number of Nodes in the Item-Set Tree 

Lemma 1. Given an ordered set I=jij,i 2 , ■ ■ ■ , ijy of n items, and a set of 
transaction nodes 14 ^ T l<k < K, 1< K <2"-l, 14 ={ai,a 2 , . . . , af, ai<a 2 < . . . < ai 
, and items aiGl, 1< i < I, 1< I < n which are uniformly distributed over itemset 
domain /, and an itemset Sj=(bi,b 2 , . . . , bj with items bi<b 2 < ... < b^ , bjel, 1< i < 
r, 1< r < n which are uniformly distributed over itemset domain I Algorithm 
Construct, produces an item- set tree 7} with expected number of nodes K such that 

where N is number of inserted transaction. 

Proof. Before proving lemma 1, we first state and prove the following lemma. The 
following lemma makes the proof easier to describe. 



Lemma 2. Given an ordered set I=jijf 2 , ■ • • , 4A of n items, and a set of transaction 
nodes 14 e 7} l<k<K, 1< K < 2"-l, 14 =fa],a 2 , . . . , af, aj<a 2 < . . . < ai , and items 
ai£l, 1< i < I, 1<1 <n are uniformly distributed over itemset domain I. Let Sj=(bj,b 2 , . 
. . , bri be an itemset with items b[<b 2 < ... < by , bi€l, 1< i < r, 1< r < n which are 
uniformly distributed over items domain 1. Given that S/ is not an empty itemset, the 
probability that there exist a node Vk^T such that the order intersection of Sj and 14 
equals an item set Z, where ZZ0, ZZSj, and ZzVk, is 



P{S,rfV,=Z, Z^V„Z^T) = ( 



1 1 / 1 \2n— 2 

3-(n-l)4) 






Proof. First we state the assumptions: 

• A transaction group (node) 14 is in T with probability 
P (V , e T ) = — — , where K is the number of nodes in T. 

k > 2 " - 1 
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• A transaction Sj and a transaction group are each represented as a set of 1’ s and 
O’s, where 0 in position i means item a,e/ does not exist, and 1 in position i 
means item a,e/ does exist. 

• Both 14 and Sj are not empty itemsets, i.e., the probability is conditioned, the 
probability of both 14 and Sj are not empty itemsets, is 

P{V, ^ O and S . 5^0) = P{V, P(Sj O) = (1 - 

• The item-set tree T has already K nodes, and each node either represents a 
transactions group or an ordered intersection of two transactions groups. 

• All K nodes in T are distinct, i.e., 14#!^; for all nodes k,\ in T. 

• both 14 and Sj are not empty itemsets. 

We use the following table to demonstrate all the requirements needed , 



Shared items X OR 





At least 1 


1 


O’s or I’s 




0 


At least 1 


Vu 


At least 1 


0 


At least 1 




1 


O’s or I’s 



The following formula gives the required probability, 

£ ((t * T + T * t) - (i * i) * (i * i + i * i) * (1 - (i) ) 

x—2 

which could be written as 

Since we assume that both 14 and Sj are not empty itemsets, the above formula should 
be divided by the probability of both 14 and Sj are not empty itemsets. Also, it should 

- — Xt p r 7 ^ T \ ^ K — ^ Novv 



multiplied hy P (V^ e T ) = & P {Z i T ) = \ 

the complete formula could be written as 

/I 

Proof of Lemma 1. We use the same assumptions given in the proof of Lemma 2. For 
each new encountered transaction group, algorithm Construct inserts either 1 node or 

Sj=S^ 

2 nodes. So, the cost function should equal to ^[1+ P{s inserts 2 node^]- 

To insert two nodes in T, the following conditions must be satisfied; 3 node Vi^eT 
such that, Sjf]‘V,=Z, Z^ Sj, V, £ T, and Z g T. 

By using Lemma 2, 

P(Sj inserts 2 node^=P{Sj rf V^=Z and Z^^,Zt^Sj,Z^Vj^,Z^T) 

Or 



Expected number of nodes - ^ [1 + ( 









In the above formula, the following inequality, is always true for n > 1 , 
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(i-arf > i 

Also, since values of k could have any number between 1 and 2"-l, which means the 
following inequality always hold 

(I < J- 

V 2”— ''2”— ^ 

Using the above two inequalities, the upper bound of the expected number of nodes K 
in an item-set tree with N transactions is 

/^ < A(1 + f ((n - l)(i)" + 

4.2 Number of Iterations to Insert a Transaction 

Lemma 3: Given an ordered set ■ ■ ■ , in}, of n items, and a set of transaction 

nodes e L l<k<K, 1< K < 2"-l, =faj,a 2 , . . . , af, ai<a 2 < ... < a; , and items 
ai€l, 1< i < I, 1< I <n are uniformly distributed over itemset domain I. Let Sj=(bj,b 2 , ■ 

. . , bj be an itemset with items bj<b 2 < ... < b,. , bjel, 1< i < r, 1< r < n which are 
uniformly distributed over items domain I. Given that all Vi^eT and Sj are not empty 
itemsets, the expected number of iterations algorithm Construct takes to enter a 
transaction into the item-set tree T is less than 

1 + n{{n - 2 ) * 2 "“' + 1 ) * 

where K is the number of nodes in T. 

Proof. In order to insert a transaction with length I, in exactly one iteration, i.e., 
first level in the item-set tree T, there are two cases. First case, there exits a node 
T j e T in first level of T, such that V^. = S j, while the second case, where neither 

Sj nor all ordered subset nodes of are in T. In other words, V ^ i T for all 
UjC'S.. Let P,^=P(V,iT,V,d’ Sj), and p^ = p(v, e T ,S j =V,) ■ The cost of 
inserting such transaction is less than 

Now to insert a transaction 5, with length /, in exactly two iterations, i.e., second level 
in the item-set tree T, there are two cases. First case, exactly one order subset of 5, 
does exist in T , and there exits a node g j-in second level of T, such that = s , 

while the second case, there exists exactly one order subset of and neither Sj nor all 
other ordered subset nodes of s, are in T. Let P^ = P(Vi, e T ,V^ S j) ■ The cost of 

inserting such transaction is less than 

2*(cr(PA + p.p^)) 

Since the maximum number of iterations is I, the expected cost of inserting 
transaction s, is + i*{C\fPf') (1) 
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By following the same assumptions given in the proof of lemma 2, the 

expected value of ^ is f ^2 ) f ) 

and, the expected value of {P^+ P^) is 1 — . Formula (1) could be written as 

(1 - P,) * (P, * / * (1 + P,)'-' + (1 + P, )') (2) 

Since (l+Pj=l+lP^+l(l-l)lf+...., 

Formula (2) could be written as 

(l-P,)*(/i?+/(/-l)/f+/(/-l)(/-2)/f+...+l+/^+/(/-l)/f+/(/-l)(/-2)/^+... 

By ignoring higher terms, the above formula could be 

1 + ( 2 / - 1 ) P^ 

Since I could go take any value between 1 and n, then the expected number of 



l=n 



iterations is ^^(l+(2/— 1)*(- 



i=i 






Which could be written as l+n((n — 2)*2" +1)* 3 

4.3 Number of Iterations to Count the Frequency of an Itemset 

Lemma 4. Given an ordered set I={ij,i 2 , ■ ■ ■ , in), of n items, and a set of transaction 
nodes e P l<k<K, 1< K < 2"-l, =fai,a 2 , . . . , aj, aj<a 2 < . . . < a; , and items 
ai£l, 1< i < I, 1<1 <n are uniformly distributed over itemset domain I. Let Sj=(b],b 2 , ■ 

. . , bj be an itemset with items b]<b 2 < ... < b,. , bi€l, 1< i < r, 1< r < n which are 
uniformly distributed over items domain I. Given that all V^G T and Sj are not empty 
itemsets, the expected number of iterations algorithm Count takes to count an itemset 
frequency in the item-set tree T, with K nodes is 



4 r _i_7n/K«+1 ^K2n-2 «(«-!). U2n-1 

+ fl'i * / y-("+y)( 2 ) -(2) ^^2) 

2 y (!-({)" )" 



)*(*) 



Proof. In order to count the frequency of an itemset sy with length I, where Oi and Oi 
are orders of first element and last elements in Sy ii e i,- , respectively, all itemsets 
S J with first element has order Oi , and last element has order Ok , which could 
have Sj as part of them should be checked. The number of such checks (or iterations) 



is 



I o, -O, 



. The count stops when we reach the full set of s -, , we will call it S f ■ 

■' J 



So, to count the frequency of itemset Sj in exactly one iteration, there should be a 
node Vk^ T such that S , or, with unsuccessful count, when the first visited 

node Vk^T such that S^- 'Z'' and ■ Let = P(Vj, e T,V^ Sj ) and 

Po = P(Vj S’’) ■ The cost of counting such transaction is l*(p,.+ P 3 ,) 
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Generally speaking, to count the frequency of an itemset Sj with length I, where 0i 
and 0i are orders of first element and last elements in 5 , ij, ii e sj, , respectively in 
exactly i iteration, The cost of counting is 

i * Pe Ps + Ps PO ) 

Since the maximum number of iterations is 2^' , the expected cost of counting 

frequency of itemset is 

(p « + p * * (c^T°' p s ) (1) 



The expected value of Ps is 



(u-2)(|)” + i+(|)2" 



)*(^) 



since, P<b + Pe =1 — Pi, formula (1) could be written as follows 



2O1-O1 j_j 

( 1 -p.ox iHctr'Ps ) 



which equals 



(l-PJ*(PX2'''^')(l+Psf “‘+(1+Pif ) 

Since, (1 + P )"' = 1 + xP + x(x — 1)P^ + 



P^(2^^)(l+Psf^~' +0-+Psf could be written 



, 1 ,2n-0/-i-Oi-l , 1 ,4n-20/-i-20i-l 

l + ( ^ ... )*(^) + ( ^ ... 



+20[-l 2 , I ,4n-0/+0[-l 2 

1 *1-^1 -fill ) + 

,")2 1 t (l-(i)")2 ’ '•2"-d 



Algorithm Count, applies the search for all other itemsets, start with lower order 
items, i.e., items with order less than Oj , one at a time. Number of such itemsets, 
including sj, is Oi .By neglecting higher terms, and sum over all possible itemsets, our 
formula could be written as 

/J_\2h-0/+0i-1 



0+0 (— 



1 -IV (l-(T)«)^ ^ ^2"-l^ 

Taking an average over Oi, which ranges from Oi to n, the above formula is 
converted to 



o +( ") * r 



(l)„+Ol-2-(q)2„-l 



)*(*) 



For simplification reason, since the minimum number of Oi is 1, we will divide the 
second term by 1. Average value over Oi , which ranges from 1 to n, the above 
formula , will be 



4 ^„j_ 7 x/Kh +1 /K2h- 2 «(n-l), 1 ,,2/1-1 
«zl + f i'l * / y-("+y)(2) -(2^ ^(2^ 

2 y (1-(|)")^ 



)*(*) 
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4.4 Conclusions and Discussion 

In this paper, we have introduced a new approach for association mining, called the 
item-set tree approach. The new approach solves some of the problems inherent in 
traditional data mining techniques, such as, data updates, accuracy of data mining 
results, performance, and user queries. The spectrum of techniques that generate 
association rules, has been studied, and two extreme cases have been analyzed. The 
main assumption in our study is that all items are equally likely to appear in an 
itemset. Although this assumption does not reflect the real life, but it gives a good 
indication about the performance of the item-set tree approach. 

We have discussed the item-set tree approach in details. In our approach, the 
transaction file is read only once. The item-set tree approach maintains a structure to 
handle frequency counting of transaction data, that allows future updates. Two 
algorithms; first, to insert transactions into the item-set tree, and second, to count 
frequencies of itemsets are investigated. Our investigations of the two algorithms 
show that the costs of insertion and counting do not depend on the number of 
transactions. The expected cost of inserting a transaction is ~ 0(1), and the expected 
cost of counting the frequency of an itemset is 0(n), where n is the cardinality of the 
domain of items. We conclude that those items that are queried most by users should 
have low order values, while those items which rarely queried by users should have 
high order values. This can be accomplished by using prior knowledge of the pattern 
of user queries. 
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Abstract. The discovery of the most recurrent association rules, in a 
large database of sales transactions requires that the sets of items bought 
together by a sufficiently large population of customers are identified. 
This is a critical task, since the number of generated itemsets grows 
exponentially with the total number of items. Most of the algorithms 
start identifying the sets with the lowest cardinality, and subsequently, 
increase it progressively. Our approach is different, since the sets to be 
considered at a time are determined by the items in the sets. The main 
advantage is a significant reduction of the CPU time required to update 
data structures in main memory. This paper presents an algorithm that 
requires only one pass on the database, presents linear scale-up property 
with the dimensions of the database and, as shown by the experiments, 
performs better than other classical algorithms. 



1 Introduction 

Association rules are a powerful and intuitive conceptual tool to represent the 
phenomena that are recurrent in data. The discovery of association rules has 
several applications in the analysis of business data, such as the basket data 
of supermarkets, failures in telecommunications networks, medical test results, 
health insurance, and many others. 

In a database of transactions, an association rule X associates two sets 
of data (also called sets of items) which are found together in a transaction. Its 
utility is to show which kind of items are frequently correlated in customers’ 
purchases. The statistical frequency of an association rule (or more generally 
of a set of items) is called support and is the percentage of transactions of the 
database in which all the items in the association rule are present. 

The number of association rules that may be extracted from a very large 
database is exponentially large with the number of the items. The feasibility of 
the problem requires a threshold for the support is provided by the analyst in 
order to discover only the most frequent association rules. Nevertheless, even 
if the problem is stated like this, it still may be critical. The most important 
step of the algorithms that extract association rules is to identify all the sets of 
items S whose support is higher than the threshold (called large itemsets). The 
computation of the effective support of a set of items requires that in reading the 
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database a counter is allocated in main memory to keep track of the number of 
transactions that contain the set. If the number of the examined sets is not kept 
low enough while the reading of the database is performed, the total number of 
counters may be too large to fit in main memory, or too much effort is wasted to 
keep the support of sets that eventually reveal to be lower then the threshold. 

The algorithms that have already been proposed [1, 2, 3, 4, 5, 6] solve this 
problem iteratively. They keep only a subset of the collection of sets in main 
memory at a time. In particular, in each iteration, the cardinality of the sets 
whose support is being computed is fixed. After the support of each of them is 
known, the pruning phase is executed, to get rid of those sets whose support 
is lower than the threshold. In the next iteration the support of the sets with 
increased cardinality is determined. These sets (called the candidate sets) are 
identified from the large itemsets found in the previous iteration. These algo- 
rithms execute the pruning phase once for each iteration, and perform as many 
iterations as the cardinality of the longest itemsets with sufficient support. No- 
tice, also, that some of these algorithms [1, 3, 4, 6] perform a reading pass on 
the database for each iteration. This reading pass determines the number of I/O 
operations performed in each iteration which are the most expensive from the 
viewpoint of the execution time. 

In this paper we propose a new approach for the identification of all the large 
itemsets. This approach is based on the observation that the collection of the 
sets that is maintained in the main memory in each iteration can be arbitrarily 
chosen. The itemsets are ordered lexicographically. The first iteration keeps the 
itemsets that start with the last item in the lexicographical order, while the 
subsequent iterations keep the itemsets that start with the other items, in the 
decreasing order. The purpose is to improve the efficiency in the generation of 
the candidate itemsets and in the reduction of the number of accesses to main 
memory required to update their support. 

A new algorithm based on this approach is proposed. The algorithm is called 
Seq for the fact that in the first step, instead of building itemsets, builds se- 
quences of items. Seq reduces I/O execution times because it makes a single 
reading pass on the database. Moreover, we will show with our experiments that 
Seq is specifically oriented to databases of very large dimensions and searches of 
very high resolution, where the minimum support is defined at very low levels. 
Seq reduces also CPU execution times because it requires only two accesses to 
main memory in the generation of a candidate itemset and when updating its 
support counter, regardless of the itemset length; moreover it executes the prun- 
ing phase once for each item, instead of once for each value of itemset cardinality 
(thousands times more in real databases!). 

The paper is organized as follows. Section 2 introduces some preliminary 
definitions and presents the algorithm. Section 3 provides an evaluation of its 
properties. Finally, Section 4 shows the results of some experiments, while Sec- 
tion 5 will draw the conclusions. 
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2 Algorithm Seq 

2.1 Preliminary definitions 

1. The database is organized in transactions. Each transaction is represented 
with the items lexicographically ordered. We indicate with T [i] the item of 
transaction T in the position i (i starts from 0). 

2. We call the set of all the possible items in the database the Alphabet. We con- 
sider it lexicographically ordered and indicate with < the ordering operator 
and with > the operator of opposite ordering. 

3. Given a transaction T of length L, the sequence of all items extracted from 
T starting at position j is denoted as Seq^ (0 < j <L) and defined as follows: 
Seq^=(T[j] T[j+1] T[j+2]-.-T[L-l]) 

T[j], the starting item in the sequence Seq^, is called the leader of the se- 
quence, whereas T[L-1], the last one, is called the terminal item. 

The number of sequences in a transaction T is equal to the length of T. For 
example, the transaction T={ABC D) has four sequences: 

Sec^={ABC D) , Sec^={BC D) , Seq^=(C'T>) and Seq^=(T>). 

4. An ordered set, and thus also a sequence <S=Seqj, is stored in the main 
memory in a tree. The first item in the sequence (Leader{S}=T[j]) is saved 
on the root node of the tree; the second one (T[j-Fl]) on a son node of the 
first one, and so on. For example, the sequence (ABCD) would be saved on 
a tree with A on the root node, B on a son node of A and so on. A given tree 
is used to store all the sequences, having the same starting item. Figure 1 
shows the tree with the two sequences {ABCD) and (ABD). The tree in the 
above example, is denoted as since A is the item in the root node and 
< is the operator of ordering of the items. Viceversa, we denote with a 




Fig. 1. tree with the two sequences {ABCD) and {ABD). 



tree in which > is used as operator of ordering to store the items in the tree. 

5. A counter is associated to the terminal node of each sequence. It keeps the 
number of transactions in which the sequence (composed of the items stored 
from the root node to the terminal node) occurs in. Observe also that the 
counter of a sequence not necessarily is stored in a leaf node of the tree: this 
is the case of the sequence {ABC), substring of the sequence {ABCD). 
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2.2 Description of the Algorithm Seq 

Algorithm Seq works in two steps. 



First Step. The database is read. For each transaction it finds all the sequences 
and stores them in the main memory in trees. For example, the transaction 
Ti = {ABCD) generates four sequences {ABCD), {BCD), {CD) and {D) that 
are stored in the four trees and ■ If a sequence is found for the 

first time, the counter associated to the terminal node is set to 1; otherwise, it 
is incremented by 1. Figure 2 shows the trees for a very simple database. 




Fig. 2. The trees after the first step is completed. 



At the end of the first step only the support of the sequences is known. 
However, the support of the itemsets can be determined from the support of the 
sequences as stated by the following theorem whose proof will be omitted. 

Theorem. The support of an itemset I, with item X as its first item, can be 
obtained as the sum of the counters of all the sequences of containing T. □ 

Second Step. It has the purpose to determine the support of the itemsets 
from the support of the sequences according to the previous theorem. The trees 
generated in the first step (T^) are used in order to produce a second set of trees 
(T^) in which the itemsets with the relative support counters are represented. 
Each tree is read, starting from the tree of the last item in the Alphabet 
and proceeding with the trees of the other items in the decreasing order. The 

sequences of each tree are taken with their counters, and from each of them 
the subsets eontaining the Leader of the sequenees are determined and stored in 
a tree (the other subsets, not containing the Leader, are determined while 
reading the other trees). These subsets are the itemsets. The counter associated 
to the terminal node of the itemset is incremented by the value of the counter 
of the sequence originating it. In this way, at the end of the reading of a generic 
tree , the counters of the itemsets originated from the sequences of that tree 
contain the correct value necessary to determine their support. If the support of 
an itemset is not sufficient, the itemset is deleted from its tree. 
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Sequence Subset Determination In order to understand the technique used to 
produce the itemsets from the sequences, consider the reading of the tree 7^ of 
Figure 2, and in particular the sequence {DY Z) (see Figure 3). 



TREES 

CONTAINING 

SEQUENCES 



/ 1 



TREES 

CONTAINING 

ITEMSETS 



< DYZ > counter=2 




2 



Fig. 3. The creation of the itemsets from the sequence {DYZ). 



At this time, the trees 7^^ and have been already read; in main memory 
there are the itemsets originated from the sequences that start with Y and Z. The 
generation of the itemsets of the sequence {DY Z) consists in the addition of the 
item D, Leader of the sequence, to all the subsets of the remaining portion of the 
sequence {{YZ)). These latter subsets are the empty set and {Z},{Y}, {YZ}. 
The empty set corresponds to the determination of the itemset {D}: the root 
node of is created only if the reading of the database in the first step has 
proved that item D has a sufficient support. As regards the other subsets, if their 
support is not sufficient, they are not found in and trees respectively, 
and no work is wasted considering their supersets. In the positive case, a leaf 
node containing the Leader D is added to the subset in the appropriate tree . 
The algorithm that creates the sequence subsets is reported. 

procedure create_subsets (sequence S, counter c, list pruneJList) 

X=Leader{S} ; 

list current_sets , previous_sets ; 

for all items I € S from last one to first one X) do 

if exists then 

leaf = add_leaf (7^^ .rootjiode, X, c, prune_list) ; 
add 7^^ .root_node to current_sets ; 
end if 

for all nodes P in previous_sets do 
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if exists a child node N of P with item I then 
leaf = add_leaf(N, X, c, pruneJList) ; 
add N to current_sets ; 
end if 
end for 

swap current_sets into previous_sets ; 
end for 

end procednre 

procednre add_leaf (node F, item X, counter c, list pruneJList) 
if exists a child node N of F with item X then 
N . count = N . count + c ; 
else 

allocate a new node N child of F with item X; 

N . count = c ; 
add N to prune_list; 
end if 

end procednre 



2.3 Seqnences Advantages 

Seq receives several benefits with the representation in terms of sequences: 

1. During the reading of the database, when not enough information is known 
to eliminate many itemsets, only the sequences are maintained in the main 
memory: the support of the sequences is a sort of “summary” of the support 
of the itemsets and enables the saving of many counters in main memory. 

2. Seq saves CPU execution time because this latter one is not determined 
only by the total number of candidate itemsets kept in main memory but 
also by the number of accesses in main memory that each of them requires. 
Seq needs two accesses in main memory when it generates a new itemset 
and when it updates its support (see the code in Section 2.2), independently 
of the itemset length. On the contrary, for the generation of an itemset of 
length k, Apriori requires k accesses. Then, Apriori checks for the presence 
of k subsets of length (k-1) that requires k(k-l) accesses. Finally, when it 
updates the support of an itemset of length k it performs k accesses. 

3. Seq executes the pruning phase very frequently (once for each item with 
sufficient support). Instead, traditionally, the pruning phase is run once for 
each level of itemset cardinality, that is about three orders of magnitude less. 



2.4 Implementation of Seq 

When very large databases are used, the number of sequences represented in 
the trees might be too large to fit in the main memory. We had to change 
the algorithm to allow the buffer management. The algorithm swaps the content 
of the trees to disk in the first step and read it in the second one. Each tree 
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is swapped to a separate file: is swapped to file to JTb, and so on. 

The sequences are written to disk in a compressed form. For example: once 
that the first sequence of (ABCD) is written, the second one, (ABDYZ) 
can be represented substituting the prefix items common to the two consecutive 
sequences {{AB)) with the prefix length (2). Especially if the transactions have 
many common sequences this technique saves a great amount of information. 

3 Evaluation of Algorithm Seq 

Three basic parameters characterize a given data mining problem: the total 
number N of the transactions, the average length L of the transactions and the 
number I of different items. Let us analyze their influence on the computational 
work of the presented algorithm and the access times to mass storage. 

Computational work Computational work involved in step 1, i.e. in the 
construction of the trees containing sequences, is relatively small and is propor- 
tional to the product L*N, that is to the size of database. The evaluation of the 
computational work in step 2, that is in the generation of the trees of the subsets 
of items, grows exponentially with L and linearly N. However, L is limited and 
is characteristic of a specific application. Besides, computational time needed to 
construct the trees of the itemsets is relatively small with respect to the times 
necessary to read the database and to store and retrieve the trees from disk. 

The execution time of Seq is nearly independent of the value of the minimum 
support. This point is rather important. Indeed, the concept of minimum sup- 
port has been introduced to reduce the computational time, but it reduces the 
statistical significance of the search. Above all, if a certain value of the minimum 
support is reasonable for the itemsets of length equal to 1, it might be enormous 
for itemsets of length 2 or more. So, the new algorithm might be adopted in 
searches characterized by very small values of resolution. 

Access time to mass storage Let T^ be the time spent to read the 
database, that is the lower bound of any algorithm. However, T^ must be in- 
creased by the time Ti? spent to save and retrieve trees. Thus the upper 
bound of the total volume of data transferred with the mass storage amounts 
to the size of the database plus two times the sizes of the files of sequences. For 
very long databases this upper bound is slightly larger than the database size. 

Two contrasting factors influence Ti? with respect to T^. Indeed, the volume 
of data in the trees would be larger than the whole database (because to 
any transaction there correspond more sequences) for those databases in which 
transactions have very few common items. This is the case of synthetic databases 
of the experiments of Section 4. On the other side, repeated transactions require 
the same information amount of a single transaction: therefore Ti? < T^ in 
case transactions are frequently repeated or have many common items. T p can 
be considerably reduced if information represented by the trees of sequences is 
compressed. In our implementation we have adopted a very simple technique 
of compression but it would be possible to reduce Ti? with more sophisticated 
compression techniques even in the case of no repeated transactions. 
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4 Experiments 

We have run our implementation of algorithm Seq using a PC Pentium II, with 
a 233 Mhz clock, 128 MB RAM and Debian Linux as operating system. We 
have worked on the same class of synthetic databases [1] that has been taken as 
benchmark by most of previous algorithms. Broadly, each transaction of these 
databases has been generated by addition of some items, extracted in a casually 
fashion, to a large itemset. For this “semi-casual” content, we believe that this 
database is not very suitable for an efficient execution of Seq: in most of the cases 
the total dimension of the files containing sequences gets comparable with the 
database dimension and execution time spent in I/O is not mainly determined by 
the database reading pass. The results of the experiments are shown in Figure 4. 

In the left column of Figure 4 we compare the execution times of Seq with 
Apriori, one of the best algorithms. The three experiments analyzed refer to the 
three classes of databases characterized by increasing transaction length (5, 10, 
15). For each database class, we show the execution times for different databases 
in which the average length of the large itemsets gets the values 2, 4, 6, 8, 12. 
Seq works better than Apriori in all the experiments, but the best gains occur 
when the average itemsets length is comparable with the transaction length. 
Observe also how Seq execution time is almost constant with respect to the 
average itemsets length, whereas Apriori increases the execution times because 
of the increasing number of reading passes on the database. Notice that, as the 
value of the minimum support decreases, Seq gets much better. This behavior is 
due to the fact that a certain number of itemsets with higher cardinality values 
result with sufficient support. In these conditions, Apriori must increase the 
number of reading passes on the database whereas the number of I/O operations 
performed by Seq remains almost constant. Furthermore, even when the number 
of I/O operations performed by Seq is comparable with Apriori, the execution 
times of Seq are still lower. Therefore, we have compared the number of candidate 
itemsets in main memory in order to ascertain whether this one was a favorable 
factor to Seq that could determine a lower CPU processing time. We have noticed 
that there is not a significant difference between the two algorithms. So, we 
have concluded that the computational work performed by Seq is lower than 
Apriori because of the lower number of accesses to main memory required in 
the generation and update of candidate itemsets, as already stated in Section 2.3. 

The experiments on the scale-up properties with respect to the dimension 
of the database are shown in the first experiment of the second column. In this 
experiment the minimum support is fixed to 0.5, but analogous results are given 
with lower values. We adopted databases with 100, 200, 300 and 400 thousands 
of transactions. The linear behavior is still verified by both Seq and Apriori but 
with very different slopes! These different increases are due again to the number 
of I/O operations. Seq reads once the database, writes once the files with the 
sequences and then reads a certain number of them. However, the size of the 
files increase of a little fraction of the database size as this latter one increases. 
On the contrary, the number of I/O operations performed by Apriori are still 
determined by the repeated reading passes on the database. 
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Fig. 4. Experiments results with synthetic databases. 
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The next experiment shows the ratio between the total storage occupation of 
the files containing the sequences and the database size with respect to different 
databases having the same statistics but different dimensions. This experiment 
confirms our previous evaluations. You can notice that for very large databases 
(2 millions of transactions) the size of the files containing the sequences is only 
a little fraction (7%) of the database size. In these cases, the I/O operations on 
the files do not influence the total time spent in performing I/O because this 
one is mainly determined by the reading pass of the database. 

The remaining experiment shows the variation of the execution times of Seq 
with the average transaction length that confirms that the execution time of Seq 
is exponential with respect to the length of the transaction. 

5 Conclusions and Future Work 

A new technique for the discovery of frequent itemset have been presented. It 
is specifically oriented to databases of very large dimensions and searches of 
very high resolution. The algorithm Seq based on this approach is characterized 
by an increased processing efficiency, since its execution times are better than 
Apriori, one of the best algorithms of current literature. Seq needs only one pass 
on the database and has a linear behavior, almost constant, with the dimension 
of the database. Moreover our experiments have shown that Seq execution time 
is nearly constant with respect to the maximum cardinality of the itemsets. 

Max-Miner [7] and Pincer-Search [6], new algorithms presented while this 
work was in the implementation phase, perform better than Apriori on databases 
with very long itemsets. Further work will compare these algorithms with Seq. 
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Abstract. Efficient partitioning of large data sets into homogeneous 
clusters is a fundamental problem in data mining. The hierarchical clus- 
tering methods are not adaptable because of their high computational 
complexity. The K-means based algorithms give promising results for 
their efficiency. However their use is often limited to numeric data. The 
quality of clusters produced depends on the initialization of clusters and 
the order in which data elements are processed in the iteration. We 
present a method which is based on the K-means philosophy but re- 
moves the numeric data limitation. 

1 Introduction 

Given a population of individuals described by a set of attribute values, clus- 
tering them into similar groups has many applications. The clustering problem 
is partitioning a population into clusters(see [1]). The population is a set of n 
elements described by m attributes. The goal is to construct a partition in which 
elements of a cluster are similar and elements of different clusters are dissimilar. 
It is generally not possible to define what it means to be similar. Also, compar- 
ing one clustering result with another is very difficult and judgement is generally 
subjective and application dependent. There are several ways of defining a mea- 
sure of adequacy for a given partition, so that the defining measure can at least 
serve as an objective function to be optimized over all possible partitions. Many 
clustering algorithms are based on finding the partition that optimize such an 
objective function ([2]). 

Two such partitioning criteria are Intraclass inertia criteria [2] useful for 
numeric attributes and New Condorcet criteria (NCC) [2] useful for categorical 
attributes. 

Clustering algorithms generally try to find a partition that optimizes the 
chosen partitioning criterion. Since the number of possible partitions is large, 
certain heuristics are used to find a nearly optimal solution. Clustering algo- 
rithms basically are of two types. Hierarchical clustering methods and K-means 
clustering methods. We consider the K-means approach in this paper. 
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These algorithms search for a nearly optimal partition with a fixed number 
of clusters. First an initial partition with a chosen number of clusters is built. 
Then, keeping the same number of clusters, the partition is improved iteratively. 
Each element is taken sequentially and reassigned to the cluster such that the 
partitioning criterion is most improved by the reassignment. Different solutions 
are obtained depending on which partitioning criterion is used. The most widely 
used criterion is the Intraclass inertia criteria. 

The K-means [3] based algorithms give promising results. However their use is 
limited to numeric data only. Also there are some other problems associated with 
this algorithm and one can not guarantee efficient clustering if these problems are 
not solved. In the following sections, we describe these problems and their effect 
on convergence of algorithm and quality of output. The major handicap of the 
K-means based algorithms is that they are limited to numeric data. The reason 
is that these algorithms optimize an objective function defined on the euclidean 
distance measure between data points and means of clusters. Minimizing the 
objective function by calculating means limits their use to numeric data. Also 
the algorithm is dependent on the order in which data elements are processed in. 
A change in the order of input affects convergence of the algorithm and quality 
of the output. 

Section 2 gives the proposed solution. Section 3 gives the performance study 
of the algorithm on different parameters based on synthetic data sets. In section 
4 we summarize our work. 



2 Proposed solution 

In this section we present a clustering method to overcome the limitation of the 
K-means algorithm for numeric data only. We also present a new initialization 
method and input order which helps in faster convergence of the algorithm. 



2.1 Extension to categorical data 

Let D = {T>i, . . . ,D„} denote a set of n objects and = {dn, . . . ,djm} be a 
data element represented by m attribute values. Let p be the number of clusters. 
The objective function that has to be minimized for K-means procedure is 

p 

F{p) = YY.^{Di,Cu) ( 1 ) 

k^l i=l 

where is data element i and Ck is centroid of cluster k. The term d{Di,Ck) 
is the total cost of assigning D to cluster k, i.e. the total dispersion of objects 
in cluster k from its centroid Ck- In case of numeric attributes, this term is 
minimized if Ckj = ^ 3 = l,...,m is minimized. Here nu is the 

number of elements in cluster k. 

For categorical attributes, the similarity measure is defined as 
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N C 

d{Di,C,) = (dfj - C[jf +wtxY, S{dtj,cy ( 2 ) 

i=i i=i 

where 6{a, b) = 0 for a = b and 6{a, 6) = 1 for a ^ b. df^ and are values 
of numeric attributes, and d^j and ^kj are values of categorical attributes for 
the object i and cluster k. N and C are the number of numeric attributes and 
categorical attributes respectively, wt is the measure used to hold the categorical 
data from monopolizing the similarity measure and hence the clustering process. 
We can rewrite F{p) as 

P rik / N c \ ^ 

Z! 4- “ 4)^ +wtx E =E {W„k+W,k){5) 
k=i \i=i i=i / k^i 

Let Cj be the set containing all unique values in the categorical attribute j 
and p(cjeCjjk) the probability of value cj occurring in cluster k. 

c 

Wck = wt x'^Hkil - p{cljeCj\k)) (4) 

i=i 

where Uk is number of objects in cluster k. Wck can be minimized if and 
only if p{cljeCj\k) > p{cjeCj\k) is minimized, for ^kj ^ Cj for all categorical 
attributes. 

The cost due to numeric attributes is minimized by calculating the numeric 
elements in centroid, while cost due to categorical attributes is minimized by 
selecting the categorical elements of centroid. 

Estimation of weight factor The influence of wt in clustering is significant. 
When wt is zero, clustering depends only on numeric attributes. If wt is greater 
than zero, an object may change cluster because it is closer to that cluster 
and its categorical attribute value is same as that of majority of objects in that 
cluster. This wt depends on many factors like number of numeric and categorical 
attributes, number of possible values each categorical attribute can take and 
most importantly the distribution of numeric attributes. We propose a method 
which does not take the number of attributes into account. First we take a 
sample data from the original data. Apply the algorithm with various wt. For 
each wt, calculate the quality of clusters with respect to numeric attributes (Qn) 
and categorical attributes (Qc)- Divide Qn and Qc with number of numeric and 
categorical attributes respectively. Then we find the wt which minimizes ^ + ^ 
When wt is too high, clustering favours categorical attributes and this increases 
Qn and decreases Qc ■ When wt is too less, clustering favors numeric attributes, 
hence Qn decreases and Qc increases. There is a particular wt for which the 
quality of clustering in numeric and categorical attributes will not favor either 
of them. The sum of fractions of qualities is minimized for this wt. 
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2.2 Initialization of clnsters 

The proposed initialization method (for details see [4]) initializes each cluster with 
one record. For each subsequent record in data, find the cluster which is closest 
and assign the record to that cluster. This reduces the number of iterations while 
not effecting the quality of clustering. Thus we do clustering during initialization 
by making one pass on database. 

In general, records which are close and are really members of same cluster will 
go to the same cluster in this initialization process. It is only the peripheral data 
elements of these clusters that move between the clusters during the iteration 
process, thereby decreasing the number of iterations. Problem comes when a 
cluster breaks and its data elements are distributed among many initialized 
clusters. However this kind of cluster breakage is very rare. 



Choosing initial k records One can choose these k records in any fashion, 
even randomly. But care should be taken that no two data records are identical. 
Other option is to divide range of each attribute into k segments and assign 
midpoint of each segment to attribute value of each initial record. For categorical 
attribute we find the first k frequent attribute values and assign each of them 
to one record. The detailed initialization procedure along with distance measure 
covering categorical attributes is presented in [4]. 

2.3 Order of input 

There are n! possible ways of giving input. It is a known fact that the algorithm 
converges in oscillatory fashion. This kind of oscillation is due to the random 
distribution of data values when considered in the input order. 

If we give input in uniform order i.e if the variation in data values is not 
drastic, the convergence may be faster. But there are many attributes involved 
and one can not identify such uniform ordering of data. For this we suggest to 
give the data in monotonic order of leading attribute. This is because, leading 
attribute is the one that is dominating the clustering process. In this way, the 
number of elements that are changing clusters very often due to the misplacement 
of some previous records will decrease. 



2.4 New distance measurement for categorical data 

Concept hierarchy plays an important role in KDD process as it specifies back- 
ground and domain knowledge and helps in better mining of data. We propose a 
new distance measurement for attributes whose values can be put in a hierarchy. 
Consider the concept hierarchy given in Fig. 1. The 0-1 distance measurement 
treats the distance between red and blue as well as red and light red equal. 
According to our new distance measurement, we take a distance called primary 
distance. This primary distance will be the minimum possible distance between 
any two attribute values. We define distance between two data elements with 
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respect to categorical attribute as d(i, j) = primary distance if attribute values 
have same parent in the hierarchy otherwise it is primary distance times the 
number of edges one has to move up to reach their common parent. So for Fig. 
1, if primary distance is 1, distance between dark red and light red is taken as 
1 while distance between light red and light green is 2. This kind of distance 
measurement is not only helpful for providing knowledge about categorical at- 
tributes but it can also be used for numeric attributes. For numeric data (Fig. 
2) if the primary distance is one, distance between attribute values 20, 36 is one 
while distance between 20, 70 is two. 




Fig. 1. Concept hierarchy for Cate- Fig. 2. Concept hierarchy for Nu- 

gorical attributes merical attributes 



To analyze the effect of this new distance measure on clustering, one needs 
a well organized database to extract hierarchy. The resulting clustering can be 
only be analyzed by applying it to a particular domain. 

3 Performance Study 

Results obtained are shown graphically in Fig. 3 to Fig. 10. For details see [4]. In 
the graphs, ’Original’ represents the values obtained when algorithm is applied on 
data without any modifications while ’Modified’ represents the values obtained 
when algorithm is applied with the proposed modifications. 

4 Conclusions 

The proposed extension of K-means clustering method allows handling of both 
numeric and categorical attributes. A new distance measurement concept for 
categorical attributes is proposed. We proposed an initialization technique which 
helps in decreasing the number of iterations. 



References 

1. F Murtagh: Multidimensional Clustering Algorithms. Physica-Verlag, Vienna., 1985. 

2. P. Michaud: Clustering techniques. Future Generation Computer Systems, (13), 
1997. 

3. J. A. Hartigan: Clustering Algorithms. 1975. 

4. K. Sambasiva Rao: K-means Clustering for Categorical Attributes. M. Tech. Thesis, 
Dec 1998, Indian Institute of Technology, New Delhi, India. 




208 S.K. Gupta, K.S. Rao, and V. Bhatnagar 




Fig. 3. Number of Iterations vs. 
Number of Records(Biased Data) 




Fig. 5. Number of Iterations vs. 
Number of Clusters(Biased Data) 




Fig. 7. Number of Iterations vs. 
Number of Records(Random Data) 




Fig. 9. Number of Iterations vs. 
Number of Clusters (Random Data) 




Fig. 4. Clustering Quality vs. Num- 
ber of Records(Biassed Data) 




Fig. 6. Clustering Quality vs. Num- 
ber of Clusters(Biassed Data) 




Fig. 8. Clustering Quality vs. Num- 
ber of Records(Random Data) 




Fig. 10. Record Movement during 
Iterations 
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Abstract. We propose a family of large itemset counting algorithms 
which adapt to the amount of main memory available. By using historical 
or sampling data, the potential large itemsets (candidates) and the false 
candidates are identified earlier. Redundant computation is reduced(thus 
overall CPU time reduced) by counting different sizes of candidates to- 
gether and the use of a dynamic trie. By counting candidates earlier and 
counting more candidates in each scan, the algorithms reduce the overall 
number of scans required. 



1 Introduction 

Mining association rules has attracted a lot of interest in recent years[l,4]. As 
introduced in [2], the association rule problem can be formally stated as: Let 
I = {* 1 , * 2 , • ■ ■ j *m} be a set of literals called items. Let D be a set of transactions, 
T, where TCI. Let X C I, Y Cl, and X Cl Y = 0 be called itemsets. An 
associaticni rule, X =^Y , holds in D with ccnifidence c if c% of the transactions 
in D which contain X also contain Y . The support of X Y is the percentage of 
transaction in D which contain YUX. Most algorithms to find association rules 
actually find large itemsets. A large itemset (or a frequent itemset) is an itemset 
which occurs in some minimum number (minimum support) of transactions. 
Finding association rules from large itemsets is relatively easy [2] . 

The well known approaches to finding large itemsets construct sets of can- 
didate itemsets (or simply candidates), and then verify whether they are large 
itemsets by scanning the database. Most algorithms are based on the level-wise 
algorithm, Apriori[4, 8, 2]. It uses the property that an itemset is large only if all 
its subsets are also large, to generate the candidate A:-itemsets, Cu, from the large 
{k — l)-itemsets, Lk-i. As each size of candidates requires one database scan, 
the number of database scans is the maximum size of candidates. With Apri- 
ori the number of candidates is low, however it also causes a lot of redundant 
computation. Here redundant eomputation means that extra CPU time (caused 
by unnecessary comparisons) is needed to count the itemsets. For example, can- 
didates A, AB, ABC are counted in scans 1, 2, 3 respectively. A is counted in 
scan 1, but still needs to be compared(though not counted) in scans 2 and 3 to 
count AB, ABC. AB is counted in scan 2, and still needs to be compared to 
count ABC in scan 3. 
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PARTITION^] reduces the database scans to two. It divides the database 
into small partitions such that each partition can be handled in main memory. 
Sampling[ll] reduces the number of database scans to one in the best case and 
two in the worst by using sampling techniques. AS-CPA(Anti-Skew Counting 
Partition Algorithm) [7] is a family of anti-skew algorithms that filter out false 
candidate itemsets at an earlier stage. DIG (Dynamic Itemset Counting) [5] par- 
titions the database into fixed sized intervals and counts the candidate itemsets 
earlier, i.e., all 1-itemsets are counted in the first interval, and in the second 
interval, both 2-itemsets and 1-itemsets are counted. 

As seen in [4] and shown in the performance study below, limited main mem- 
ory can be a problem when determining large itemsets. However, this issue has 
been surprisingly ignored in most of the literature. The objective of this paper 
is to address this issue. Section 2 describes main memory issues and provides 
further motivation for this research. In section 3, the two dynamic candidate par- 
tition algorithms which adapt to available main memory are briefly described as 
is the dynamic trie data structure which they use. Section 4 reports on prelimi- 
nary performance results, and Section 5 concludes the paper. 



2 Main Memory Issues 

Let TO be the main memory size, n the maximum size of candidates, Cu the set 
of candidate A:-itemsets, and mem(Ck) the main memory requirement for Ck- 
There are two insufficient main memory cases: 

1. Ck) > m and mem(Ck) < to for all k. There will be insuffi- 
cient main memory for the database scan reduction algorithms [9,11,7,5]. 
However, the level-wise [4,8,2] algorithms still work in this situation. 

2. '^f,mem(Ck) > m and mem(Ck) > m for some k. There will be insuf- 
ficient main memory for all the sequential algorithms. Only some parallel 
algorithms [3, 6, 10] have considered this case. 

We observe that the main memory requirement for each scan (in Apriori) 
is different. In the middle scans there are more candidates and thus more main 
memory requirement, and in the beginning or ending scans there are less poten- 
tial candidates and thus less main memory requirement. 

The Apriori algorithm can be adapted to main memory limitations by two 
simple changes [4]: 

1. If during scan k, all candidates of size k do not fit into main memory, then 
divide the candidates into groups based on the available memory size and 
count each group separately. 

2. Since the number of candidates found during the last scans of the database 
may be few, later scans may be combined to reduce the number of scans of 
the database. Thus different size candidate sets will be counted at the same 
time. 
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We refer to this improved version of Apriori as AprioriMem. This approach 
will result in multiple database scans for a pass with insufficient memory. Here 
the term pass indicates examining all candidates of one size. A scan is one 
complete reading of the database. With Apriori, scans and passes are identical. 
With AprioriMem, the total number of scans is: 



Er 



k=l 



i{Ck)- 



( 1 ) 



The objective of our approach is to obtain a more even distribution of can- 
didates across all the scans. Thus the number of database scans will be: 



m 

By distributing the candidates across all scans, main memory is fully utilized 
in each scan(the last scan may be underfull), and the number of scans can be 
reduced. 

A simple approach, we call Static Candidate Partition Algorithm (SCPA), is 
a generalization of AprioriMem and Sampling [12]. As with AprioriMem, when 
there is insufficient main memory, it partitions the candidates and counts each 
partition of candidates separately. The highlights of this algorithm include: 

1. We assume the existence of results from prior sampling as input to the 
algorithm. This gives a set of potential large itemsets, PL. 

2. All large itemsets can be found in two steps as in Sampling]!!]. The first 
step is to determine the true large itemsets in the approximate set PL. The 
second step is to find the remaining large itemsets by counting the missing 
candidates (not counted in the first step, but may be true large itemsets). 

3. Itemsets of different sizes may be counted during any database scan. So 
if during scan k, memory can hold more candidates than are in Ck, then 
candidates of size k + 1 and larger will be added to the scan until memory 
is full. 

4. All missing candidates are generated in the second step at one time. This 
creates an exponential effect (too many missing candidates may be generated 
as shown in Lemma !). 

Lemma 1. Let n be the number of missing large 1-itemsets, andm be the num- 
ber of true large itemsetsfi.e., \TL\) found in the first step of SCPA. There are 
at least 2" x (to + !) — (to + n + !) missing candidate itemsets generated. 

Proof: see [12]. 



3 Our Approach: Dynamic Candidate Partition 
Algorithms 

We propose two algorithms based on the simple idea of the Static Partition 
Algorithm. The algorithms which we propose attack both of the insufficient 
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memory problems discussed in the last section. As with SCPA, we assume that an 
approximate set of large itemsets, PL, is given. PL may have been obtained from 
a sampling step or may be the actual set of large itemsets from an earlier database 
step. A more thorough discussion of these algorithms and their performance can 
be found elsewhere [12] 

Unlike Sampling and SCPA, there is only one step, i.e., the missing candidates 
are generated dynamically and iteratively as needed. Compared to the static 
partition approach, the dynamic partition algorithms have the following features: 

1. There is no separate step for the missing candidates. Instead the missing can- 
didates are generated dynamically and counted in different database scans. 

2. The exponential effect is avoided by the iterative generation of missing can- 
didates. 

3. They adapt to the available main memory, thus making maximum usage of 
main memory and reducing the number of database scans required. 

4. A dynamic trie data structure is used to count the candidates so that there 
is less redundant computation between different scans. 

5. Some false candidates can be identified and pruned away by employing the 
large itemsets from the previous scans. 

We use a trie data structure, which is similar to that in [5] with some changes. 
We assume that all itemsets are listed in lexicographic order so that each item 
has a unique position in any path. For convenience, we view all the candidates 
generated from the PL (i.e. PL U BD~{PL), where BD~{PL) is the negative 
border function in [11], which includes the itemsets all whose property subset 
itemsets are in PL) as existing in a virtual trie. In actuality this trie does not 
exist. Having this trie in mind, however, will help to understand how our pro- 
posed algorithms function. In effect, the algorithms dynamically choose a subset 
of the nodes in the trie to count (materialize) during each database scan. The 
algorithms differ in how these nodes are chosen. Both, however, ensure that 
the set of candidates chosen during each scan will fit into the available main 
memory. At any time, the (materialized) trie contains the maximum number of 
nodes needed for the itemsets being counted. Initially the trie contains the set 
of candidates been considered in the first database scan. At the end of each scan 
the trie is traversed, nodes are removed, and new nodes are added. When the 
algorithms terminate, only large itemsets remain. The difference between the 
trie data structure and the hash tree which has been previously used [4], lies in 
the fact that all nodes in the trie represent an itemset, while in the hash tree 
only leaf nodes contain the itemsets. 

Example 1: Figure 1 a) shows a sample trie for items {A, B,C, D, E, F,G, H, 
I,J}. Suppose that through historieal information it is determined that the ap- 
proximate set of large itemsets: PL = {A, B, C, E, AB, AC, AE, BC, BE, ABC, 
ABE}. The eandidates for the first pass of SCPA, then, are: PLUBD~{PL) 
(BD~(PL) = {D, F,G, H, I, J,CE}). Figure la) shows the trie whieh is used 
during the first step of SCPA. After the first step, suppose we find the large item- 
sets: {A, B, C, D, AB, AC, BC, ABC}, among whieh A, B, C, AB, AC, BC, ABC 
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are true large itemsets, and D is a missing large itemsets. The eandidates for 
the seeond step are then: {AD,BD,CD,ABD,ACD,BCD,ABCD}, whieh are 
generated from the missing large itemset D. The trie for the seeond step is shown 
in Figure Ih). After the seeond step, we find the remaining large itemset: AD. 



II 




a) Trie for First Step 




D 

a) Trie for Second Step 



Fig. 1. The Trie Data Structure Used in SCPA 



To facilitate the dynamic algorithms, status information is kept for each node: 

— ALL.COUNTED: The node and all its descendent nodes have been counted. 

— ALL_UNCOUNTED: Neither the the node nor any of its descendents have 
been counted. 

— DESCENDENT_UNCOUNTED: The node itself has been counted but at 
least one of its descendent nodes hasn’t been counted. 

During processing a transaction, a depth-first traversal over part of the trie is 
used to increment the counts of the itemsets in the trie. 

The Breadth-First Partition Algorithm (BFPA) selects the candidates for 
each scan through a breadth-first traversal over the virtual trie until main mem- 
ory is full, i.e., first candidate 1-itemsets, then candidate 2-itemsets and so on. 
When generating candidates, we assume that all immediate children of a node 
are generated together and counted in one scan. This reduces redundant com- 
putation as well as simplifying candidate generation. 

Example 2: Figure 2 shows the tries whieh are used to implement BFPA for 
Example 1 assuming there is enough memory for 15 nodes in the trie. The total 
number of database seans is two. In the first sean, all eandidate 1-itemsets and 
some of the eandidate 2-itemsets are eounted. In Figure 2, we don’t generate AE 
in the first sean, although the memory ean still hold one more node, as AB, AC 
ean not be held together with AE. At the end of this first sean, small itemsets are 
removed and new nodes are added using the negative border. The materialized 
trie for the seeond sean is shown in Figure 2b). 

The memory requirement for the dynamic algorithms, can be reduced by 
writing large itemsets to hard disk whose counting status is ALL_COUNTED. 
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II 




b) Trie 



( 1 ) 

for First Scan 




Fig. 2. Breadth-First Partition Example 



In Figure 2, D has been written to the hard disk as D has been counted, used 
for generating candidates, so no line is shown to that node in the figure. 

One disadvantage of the breadth-first partition is that there is still redundant 
computation between different database scans, as the counting status value of a 
counted node may be DESCENDENT_UNCOUNTED. However, its redundant 
computation will be less than that of Apriori, as candidates of different sizes can 
be counted in one scan. 

The Breadth-Eirst Partition Algorithm is shown in Algorithm 1. 

Algorithm 1 
Input: 

PL: PLi U PL 2 • • • U PLn- //Probable large itemsets 
I ,D ,s // items, database of transaetions, minimum support 
m // memory sizefmaximum number of eandidates in memory) 

Output: 

L : Li U 1/2 • • • U //Large Itemsets 
BFPA Algorithm: 

1) TRIE = I //initially 1-itemsets 

2) k = l 

3) while mem{TRIE U BD~ (PLi.)) < to do begin 

4) TRIE = TRIE U BD~(PLi^);//expand the trie by one level 

5) k + +; end 

6) repeat //the negative border of PLi^ will span different seans 

7) Find maximal SS C PLj., where mem{T RI E V) B D~ {S S)) < to. 

8) TRIE = TRIE U SS; 

9) PLk = PLk - SS; 

10) TL ={the true large itemsets in TRIE found in D} 

11) ML ={the missing large itemsets in TRIE found in D} 

12) TRIE = TLU ML//other small itemsets are removed from the trie 

//BD~* only generates the negative border that has NOT been 
//generated in previous seans. 

13) TRIE = TRIE U BD-*{TRIE) 
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14) 'while mem{BD *(ML U TL U < to do begin 

15) TRIE = TRIE U (BD-*(ML UTLU PL*)); 

16) k + +; end 
27)nntil TRIE does not change 
18)return L = MLUTL; 



By processing the trie in a depth-first, rather than breadth first manner, a 
Depth-First Partition Algorithm (DFPA ) is created. DFPA selects the candidates 
for each scan through a depth-first traversal over the virtual trie until main 
memory is full. The details of the DFPA algorithm are found in [12]. 

4 Performance Results 

In this section we discuss the performance results comparing five different al- 
gorithms: DFPA, BFPA, SCPA, Apriori, and AprioriMem. The first four algo- 
rithms were actually implemented in C. The performance of AprioriMem was 
estimated based on the results of Apriori without any memory constraints. 

We report on experiments performed using one synthetic data set from [4] 
D100K.T10.I4: transaction number of lOOK, average transaction size of 10 and 
average size of large itemsets of 4. The number of items and large itemsets are 
set to 1000 and 2000 respectively as in [4]. Two minimum supports 0.5% and 
0.1% were used, and for each minimum support we ran the algorithms with two 
memory sizes (number of nodes in trie) 500k and 50k. 

Before running our algorithms for each minimum support, an approximate 
set of large itemsets is obtained by sampling with the same minimum support. 
In order to show how our algorithms perform when the approximate set is not 
accurate (very possible if obtained from historical data), we use a small sample 
size(25K) and the same minimum support (instead of reduced as was proposed 
for the Sampling algorithm) to find the approximate set. 

We examined three metrics: the number of scans, candidates and compar- 
isons. We report on the results of experiments for each metric in the subsequent 
three subsections. 

4.1 Number of Scans 

Table 1 shows the number of database scans of the algorithms for all four cases. 
The Apriori ran out of memory when the minimum support was 0.1%. We reran it 
on another machine with a larger memory to get the results. The numbers in the 
column Apriori are the pass numbers for Apriori with no memory constraints(so 
each pass can be put in the memory) as for the dynamic algorithms. The number 
of scans for AprioriMem in Table 1 was estimated by formula 1 based on the 
number of candidates for each pass of Apriori and the memory constraint. 

Our dynamic algorithms (Depth-first and Breadth-first) have the fewest scans 
in all cases with memory constraints, because they employ the historical or sam- 
pling information to identify candidates earlier and fully utilize memory in each 
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min_sup 


maxjnem 


Depth-first 


Breadth-first 


Static 


AprioriMem 

(Estimated) 


Apriori 

(No Mem Cnstrnts) 


0.5% 


500k 


2 


2 


7 


4“ 


4 


50k 


8 


8 


28 


11“ 


0.1% 


500k 


2 


2 


3 


11“ 


11 


50k 


11 


11 


12 


20“ 



Table 1. Number of scans for D100K.T10.I4 



scan. It is interesting to note that the static algorithm suffered from the ex- 
ponential effect (too many candidates shown in Table 2 in the second step and 
thus too many scans), as the approximate set was not accurate enough due to 
high minimum support(0.5%) and small sample size(25K). By our dynamic iter- 
ative way of generating of the missing candidates, both the dynamic algorithms 
avoided the exponential effect. 



4.2 Number of Candidates 



Table 2 shows the total number of candidates examined by each algorithm. Not 
surprisingly, Apriori has the fewest number of candidates, as its candidate gener- 
ation is level-wise. As AprioriMem has the same candidate numbers as Apriori, 
we only show its numbers in the column Apriori. Note that our dynamic al- 
gorithms employ the historical or sampling information to identify candidates 
earlier, but due to the inaccuracy of the approximate set some candidates may 
turn out to be false(not candidates of Apriori), thus we may have more candi- 
dates generated. 



min_sup 


maxjnem 


Depth-first 


Breadth-first 


Static 


Apriori 

(No Memory Constraints) 


0.5% 


500k 


433816 


433816 


1561579* 


370414 


50k 


431117 


428251 


0.1% 


500k 


530820 


530531 


532671 


528586 


50k 


529017 


528992 



Table 2. Number of candidates for D100K.T10.I4 



Compared to the static algorithm, both dynamic algorithms have fewer can- 
didates. When the minimum support is high(the accuracy of the sampling is 
low), both Depth-first and Breadth-first do not suffer the exponential effect, as 
they use the dynamic iterative way to generate the missing candidates. With 
the minimum support decreasing thus increasing the accuracy of the sampling, 
both dynamic algorithms have candidate numbers closer to that of Apriori. 
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4.3 Redundant Computation Reduction 

In order to measure the reduction of redundant computation by our dynamic 
algorithms, which directly impacts the CPU time, we use the number of compar- 
isons for the algorithms. The comparisons reported only include those directly 
involved in counting. For our dynamic trie data structure, whenever item asso- 
ciated with a node is compared(even when not incrementing the counter), the 
number of comparisons is increased by one. With the hash tree data structure 
(in Apriori), for each fc-itemset in the leaf node, k comparisons is required to 
find out whether it is in the transaction. 



min_sup 


maxjnem 


Depth-first 


Breadth-first 


Apriori 

(No Memory Constraints) 


0.5% 


500k 


7.36e6 


7.36e6 


1.63e7 


50k 


1.50e7 


1.55e7 


0.1% 


500k 


2.89e7 


3.97e7 


8.87e7 


50k 


3.95e7 


4.62e7 



Table 3. Number of Comparisons for D100K.T10.I4 



As shown in Table 3, both the dynamic algorithms have reduced the num- 
ber of computations from that in Apriori. The major redundant computation 
reduction comes from the dynamic trie data structure and counting candidates 
of different sizes together in one scan. A depth-first traversal over part of the trie 
for counting reduces the number of comparisons required, as the common pre- 
fixes of the itemsets are counted only once for a transaction. By contrast, in the 
hash tree every candidate A:-itemset requires at least k comparisons regardless 
of their common prefixes. Also the itemsets which are counted in the previous 
scans are compared in later passes, as Apriori is level- wise. 

As expected. Depth-first reduced more redundant computation than Breadth- 
first. This results from its depth-first nature. Depth-first counts a node and all 
its descendent nodes (which have been generated as potential candidates) in one 
scan, thus it does not need to compare these nodes again in later scans. Breadth- 
first needs to keep the nodes counted in the previous scans in the trie in order 
to count their descendent nodes in the later scans. 

5 Conclusion and Future Work 

We have proposed two dynamic algorithms for mining large itemsets. They are 
adaptive to the amount of memory available. By dynamic pruning and dynamic 
iterative generation of the missing candidates, they result in fewer database scans 
without increasing many candidates. Both the algorithms employ historical or 
sampling data to identify candidates earlier. 

In the future we will perform experiments using more realistic datasets and 
compare to more algorithms. The Depth-first algorithm and the Breadth-first 
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algorithm have complementary advantages and disadvantages. We are developing 
a hybrid approach of the Depth-first and the Breadth-first combining the best 
features of each. 
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Abstract. The user interested in mining a data set by means of the 
extraction of association rules has to formulate mining queries or meta- 
patterns for association rule mining, which specify the features of the 
particular data mining problem. 

In this paper, we propose an exploration technique for the discovery of 
association rule meta-patterns able to extract quality rule sets, i.e. as- 
sociation rule sets which are meaningful and useful for the user. The 
proposed method is based on simple heuristic analysis techniques, suit- 
able for an efficient preliminary analysis performed before applying the 
computationally expensive techniques for mining association rules. 



1 Introduction 

The research area named Data Mining, also known as Knowledge Diseovery in 
Databases, knew an impressive growth in the last years. In the literature, a large 
number of papers appeared, with interesting and promising results, concerning 
a large variety of topics. In the early years the attention was primarily focused 
on the development of efficient techniques able to analyze and discover patterns 
from large and very large data sets. Several techniques were considered, from 
classification [7] to the discovery of similarities in time series [1], but the most 
investigated technique is the extraction of association rules [2, 3]. 

In a second moment, the researchers moved their attention to other topics, 
such as the development of specification languages [5, 8], that allow the user 
to specify a generic problem based on the extraction of association rules, and 
the integration of data mining techniques with relational databases [6], in or- 
der to develop systems [9] that exploit the presence of a relational database as 
repository for the analyzed data. 

The developed techniques are powerful, but practical experiences show that 
their use is not easy and immediate. Thus, a methodological support should be 
provided. An attempt to give an answer to this problem is the Knowledge Dis- 
eovery Proeess (KDP) proposed in [4]. The KDP starts with the comprehension 
of the context, followed by the selection of a significant data set. Then, after 
data are preprocessed and simplified, the KDP requires the user to identify a 
data mining task, choose the data mining method or technique and apply it. 
Finally, results are evaluated, in order to formalize the discovered knowledge. 

The Data Mining phase of the KDP requires the choice and the application 
of a data mining technique; chosen the technique, the user is asked to drive the 
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process, typically by selecting the features of the data that must be investigated. 
If the user chooses the extraction of association rules, such features are, e.g. the 
attribute whose values are associated by rules, etc.. The features relevant for the 
extraction of association rules constitute a meta-pattern. 

However, it may be difficult to specify significant meta-patterns, when the 
data set to analyze is unknown, or has a large number of attributes. 

In this paper, we propose a method to explore the data and identify meta- 
patterns for the extraction of association rules that, applied to the data set to 
analyze, cause the extraction of quality association rule sets. Such an exploration 
method is meant to be preliminary, in the sense that it must precede the actual 
extraction of association rules; hence, it must require light preliminary analysis 
of the data. Since exact techniques would require complex and expensive analysis 
of the data, we base the method on simple heuristic techniques. 

The paper is organized as follows. Section 2 introduces the semantic frame- 
work and formalizes the concept of association rule meta-pattern. Section 3 pro- 
vides an introductory overview of our method. Section 4 and 5 discuss in details 
the analysis phases required by the method. Section 6 draws the conclusions. 

2 Meta-Patterns for Association Rules 

The problem of mining association rules has been defined and mostly studied in 
the case of transaction data sets, where rules associate items sold in commercial 
transactions. However, the technique is general, and can be applied to any data 
set, whose structure might be complex. It is necessary to properly define what 
kind of information we like to mine from the analyzed data set. For this purpose, 
we introduce the concept of meta-pattern for assoeiation rules. 

A meta-pattern for association rules is a tuple p : {T,g,m,s,c), where T, g, 
TO, s and c are the parameters of the pattern. Their meaning is the following. 
T: this parameter denotes the source data set. We assume that the data set 
is a relational table. ^ The notation Schema{T) denotes the set of attributes 
(columns) of the table. 

to: this parameter denotes the rule attribute (also named the mined attribute), i.e. 
the attribute on which rules are mined; if we denote the set of to’s values as Vm, 
an association rule associates values of Vm. More precisely, for a rule r : B H, 
it is H C Vm, H C Vm, \H\ = 1^, 5 n Ff = 0. The size of a rule r : B ^ H is 
the number of values in the body and in the head, i.e. size{r) = \B\ \H\. 

g: this parameter denotes the grouping attribute, w.r.t. which rules express reg- 
ularities. The source data set is logically partitioned in groups having the same 
value for the grouping attribute^. Rules associate rule attribute’s values that 

^ This assumption is motivated by the fact that data sets to analyze usually come 
from relational operational databases or ROLAP data warehouse servers. 

^ This constraint is motivated by the need for clarity in examples. In effect, the tech- 
nique proposed in the paper is not affected by the cardinality of heads. 

® For the sake of clarity, parameter g (resp. m) of meta-patterns defines one single 
attribute. We can generalize assuming that g (resp. m) defines a list of attributes. 
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Fig. 1. (a) Table Transactions, (b) A rule set extracted from the table. 



appear together in the same groups. In the following, the total number of groups 
is denoted as G. 

s: this parameter specifies the minimum support. The support Sr of a rule r is 
defined as Sr = Gr/G, where Gr is the number of groups that contain r. The 
support denotes the frequency with which the rule appears in groups. In order 
to select only relevant rules, a minimum threshold s for support is defined, so 
that only rules with Sr > s are mined. 

c: this parameter specifies the minimum confidence. The confidence of a rule 
r is defined as Cr = Gr/Gs where Gs is the number of groups that contain at 
least the body B of the rule. The confidence denotes the conditional probability 
of finding the entire rule in a group in which the rule’s body is found. In order to 
select only rules with significant confidence, a minimum threshold c for confidence 
is defined, so that only rules with Cr > c are mined. 

The application of a meta-pattern p to an instance of the source table T 
produces a set of association rules, denoted as R. 

Example 1: Consider table Transactions depicted in Figure l.a. It report- 
s data about commercial transactions. Each transaction has an identifier (at- 
tribute id), the customer that performed the transaction (attribute customer), 
the product object of the transaction (attribute product), the day in which the 
transaction is performed (attribute day), specified by means of the progressive 
number from the beginning of the year, the month of the transaction (attribute 
month), where 1 stands for January and 2 for February. 

Consider p : (Transactions, customer, product, 0.15, 0.5). It extracts rules 
from the table Transactions, such that rules associate products (mined at- 
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tribute product) frequently purchased together by a single customer (grouping 
attribute customer); rules are considered relevant if they hold for at least 18% 
of customers (minimum support s = 0.18), and their conditional probability is 
at least 50% (minimum confidence c = 0.5). The meta-pattern produces the 
rule set shown in Figure l.b. □ 

We now introduce some properties concerning association rule sets."^ 

Proposition 1: Consider an association rule meta-pattern p : {T,g,m,s,c) 
applied to a data set T. Let gi be a group in the data set T partitioned by the 
grouping attribute g, and R be the association rule set extracted by p. The upper 
bound for the size of rules in R is the cardinality of the largest group in T, i.e. 
maXreR(size(r)) < max(\gi\). □ 

Theorem 1: Consider a meta-pattern p and a table T. If there exists in T a 
functional dependency g ^ m between the grouping attribute g and the rule 
attribute to, p applied to an instance of T produces an empty rule set R. □ 
Theorem 2: Consider a meta-pattern p and a table T. If there exists in T 
a functional dependency m ^ g between p.g and p.m, the highest minimum 
support to obtain a non-empty rule set i? is s = 1/G. □ 

3 Our Approach 

Consider now a user having to analyze a data set. The user has to formulate 
a meta-pattern, but it is not easy to formulate a meta-pattern that produces 
a satisfactory rule set, i.e. a quality rule set, that can be effectively exploited, 
e.g. for decision making. In general, it is not possible to give an exact definition 
of quality rule set, because it depends on the applicative case and the user’s 
expectations; however, several quality faetors for rule sets can be identified. 

- Suitable number of rules. The first factor perceived by the user is the 
number of mined rules. If they are too few, the rule set may be useless because 
it does not provide information. If they are too many, the rule set may be useless 
too, because it provides too much information. 

- Easy comprehension of rules. An easily comprehensible rule is appreciated 
by the user. Typically, the size of a rule heavily affects the comprehension of the 
rule: small sized rules are preferable, because easily read and interpreted; large 
sized rules are difficult to read and to interpret. 

- Adequate synthesis level. Association rules provide a synthetic view of the 
data set. An adequate synthesis level depends on the applicative case, but, to 
be significant, the rule set R must synthesize a significant portion of the data. 

- Significance. The significance of a rule set can be intended as the degree of 
useful information provided to the user. Rule sets with a significant number of 
unexpected rules are preferred by the user to rule sets which contain a large 
number of expected and obvious (depending on the applicative context) rules. 

For brevity, we do not report proofs. They can be found in [10]. 
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Without any exploration method, the user can obtain a quality rule set only 
by performing several extraction trials. Unfortunately, when the data set is 
unknown, this approach can result in a large number of failures and waste of 
time, because the extraction of association rules from within a large data set is 
expensive in terms of execution time. 

Consequently, the user need an exploration method to define suitable associa- 
tion rule meta-patterns, possibly based on a simple and fast preliminary analysis 
of the data. 

4 Choosing the Attributes 

The first issue to deal with while defining an association rule pattern is the 
definition of grouping and rule attributes. In this section, we discuss the phases 
of the proposed exploration method which concern the choice of attributes. 

4.1 Grouping Attribute 

The first parameter of a meta-pattern to identify is the grouping attribute g. To 
begin our discussion, we need the following definition. 

Definition 1: Let T be a table, whose schema (ai,...,a„) is denoted as 
Schema{T). Let a be an attribute of T (a € Schema{T)), and Va be the set 
of a’s values appearing in T. If the table is grouped by attribute a, the average 
number of tuples in a group, denoted as ria, is defined as ria = \T\ / |Ua|. □ 
High values of ria denote that groups contain a large number of tuples. Con- 
sequently, it is possible to extract a large number of rules, with relevant size 
(Prop. 1). Low values of ria denote that groups contain a small number of tu- 
ples. Consequently, it is possible to extract rules whose size is small (Prop. 1). 
Attribute Seleetion. Evaluated Ha, the user selects the attributes considered more 
interesting, based on the quality criteria introduced in Section 3. Attributes are 
ranked by increasing values of Ha ■ those with the lowest values of Ha are prefer- 
able, because the size of rules is limited by small size of groups (Proposition 1). 
Attributes with value of Ha equal to 1 must be discarded, as illustrated by the 
following theorem. In our experience, we found that value of Ha greater than 
1 but less than 5 or 6 give the best results in terms of comprehension of the 
resulting rule set. 

Theorem 3: Consider a table T and an attribute a of T (a € Schema{T)). 
If ria = 1, every association rule pattern p with parameter g = a produces an 
empty set of rules. □ 

Example 2: Consider the table Transactions of Figure l.a. The ranked at- 
tributes with associated the cardinality of their set of values and the corre- 
sponding value for Ha are reported in the following table. 

Attribute \Va\ Ha Attribute \Va\ Ha 

id 25 1 discarded product 6 4.17 selected 

customer 11 2.27 selected month 2 12.5 not selected 

day 9 2.78 selected □ 
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4.2 Rule Attribute 

The second parameter to specify in an association rule meta-pattern is the rule 
attribute m. To lead on our discussion, we need the following definition. 
Definition 2: Let T be a table, whose schema (ai,...,a„) is denoted as 
Schema{T). Let a and b be two attributes of the table {a,b € Schema{T)), 
and Va and VJ, be the sets of a’s and b’s values, respectively, appearing in T. The 
density of b’s values w.r.t. a’s values da,b is defined as da,b = \Vb\ / |Ki|- 

In other words, given two attributes a and b, da,b gives an idea of the distri- 
bution of values of the (possible) rule attribute b w.r.t. the (possible) grouping 
attribute a. Two situations illustrate the importance of the density of mined 
values w.r.t. groups. High values of da,b denote that items in Vb may appear in 
a large number of groups. Consequently, the meta-pattern would extract a large 
number of rules, possibly of relevant size, and with high support. This is in con- 
trast with the quality criteria discussed in Section 3. Low values of da,b denote 
that the rule attribute has too few values w.r.t. the grouping attribute. Conse- 
quently, most of groups may contain the entire set Vb , causing the extraction of 
all possible permutations of values in Vb with very high support. 

We now introduce two indexes that can provide useful indications about the 
choice of the rule attribute. 

Definition 3: Let T be a table, whose schema (ai,...,a„) is denoted as 
Schema{T). Let a and b be two attributes of the table {a,b € Schema{T)). 
Suppose that T is grouped by attribute a; for each group gt, we denote with 
Zi the number of distinct values of b that appear in the group gi. The average 
number of distinet values of attribute b in a group, denoted as aa,b, and the max- 
imum number of distinet values of attribute b in a group, denoted as Ua,b, are 
defined as Oa,;, = (I]g, -z*) / |K|, = maxg,(zi). □ 

Definition 4: Let T be a table, whose schema (ai,...,a„) is denoted as 
Schema{T). Let a and b be two attributes of the table {a,b € Schema{T)), 
and Vb be the set of b’s values appearing in T. Suppose that T is grouped by 
attribute a; aa,b is the average number of distinct values of 6 in a group. The 
average fraetion of the set of b’s values in eaeh group, denoted as is defined 
as la,b = aa,b / \Vb\- □ 

In other words, by means oiaa,b it is possible to see if the pair of attributes is 
suitable for small sized rules, which are preferable based on the quality criteria. 
Furthermore, by means of /„ ^ it is possible to estimate how many values of the 
rule attribute appear together in the same group: if they are too many (/^ ^ close 
to 1), it may happen that almost all the permutations of b’s values have a very 
high support and would be extracted by the meta-pattern; this is in contrast 
with the quality criteria. 

Finally, some useful properties can be proved by means of the indexes. 
Theorem 4: Consider a table T and a pair of attributes (a,b) of T {a,b € 
Schema{T)). If the functional dependency a ^ b holds, it is aa,b = 1. □ 
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Pair (a,b) 

(product, id) 4.16 discarded 

(day, id) 2.78 discarded 

(customer, id) 2.27 discarded 

(product, customer) 1.83 discarded 

(product, day) 1.5 discarded 

(day, customer) 1.22 discarded 



Pair (a,b) da,t ag,b fa,b 

(customer, day) 0.82 1.45 0.15 selected 
(day, product) 0.67 2.2 0.37 selected 

(customer, product) 0.55 2.27 0.36 selected 
(product, month) 0.33 1.83 0.92 not selected 

(day, month) 0.22 1 0.50 not selected 

(customer, month) 0.18 1.09 0.55 not selected 



Fig. 2. The ranked list of attribute pairs. 



Theorem 5: Consider an association rule meta-pattern p applied to a table T, 
where g and m are the grouping and rule attributes, respectively. The upper 
bound for the size of rules in R is Ug^m, be. maXreR(size(r)) < Ug^m- 1^1 

Attribute Selection. Evaluated da,b, the user selects those pairs considered more 
promising, based on the quality criteria introduced in Section 3. Pairs of at- 
tributes are ranked by decreasing values of da,b- those with low values of da,b 
are preferable, provided that this value is not too small. In our experience on 
practical cases, we observed that the range of values for da,b that gives the best 
results in terms of quality of the extracted rule set is between 0.01 and 0.1. 

A significant help can be obtained by Oa,b and As far as the former is 
concerned, small groups are preferable (e.g. from 2 to 5), to concentrate on meta- 
patterns that possibly extract small sized rules. As far as the latter is concerned, 
low values are preferable, because values of /„ ^ close to 1 denotes that almost 
all the set of mined values is frequantly present in groups. 

Moreover, a pair (a, b) should not be selected whether there exists a functional 
dependency between a and b. In fact, by Theorem 1, the dependency a ^ b 
causes the generation of an empty rule set R. By Theorem 2, the dependency 
b ^ a causes the extraction of rules only if the minimum support is s = 1/G; 
thus, rules does not represent regularities (they appear in one single group) and 
are useless. Observe that the existence of a functional dependency a ^ b is 
putted in evidence when Oa,b = 1, as shown by Theorem 4. 

Finally observe that pairs of attributes with da,b greater than 1 should not 
be considered: in fact, the extracted rules would be of relevant size, with a very 
high support and consequently very numerous, so that none of the quality criteria 
would be met. For such pairs, aa,b and /„ ^ should not be considered. 

Example 3: Consider table Transactions of Figure l.a, and the attributes se- 
lected as grouping attribute in Example 2. We consider all the pairs between one 
of the selected grouping attributes and the remaining attributes, and compute 
da,b for each of them. We obtain the list reported in Figure 2. 

We obtain a relevant number (6) of pairs that are discarded because the 
value of da,b is greater than 1. On the opposite side, three pairs are not selected 
because the value is too low. This is not surprising: in the case of the pair 
(product, month) almost all the products appear in both the two considered 
months; this situation is confirmed by the very high value (0.92) of /„ j. The 
same is for the pair (customer, month), for which we also notice a very small 
value ofaa,b (1-09) and a significant value of /„ ^ (greater than 0.5). Finally, the 
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pair (day, month) does not give rise to any association, since each day belongs 
to one single month (a functional dependency, as pointed out by ^ = 1). 

To conclude, observe that in this example the values of da,b for the three 
selected attribute pairs range from 0.55 to 0.82. This seems to be in contrast 
with our previous assertion that good values are from 1% to 10%. However, 
this is a toy example, with a very small number of rows. The range 1% to 10% 
demonstrated to be fine for real large data sets. Furthermore, the example also 
shows the importance of aa,b and /„ ^ to better comprehend the situation. □ 

5 Minimum Support and Confidence 

Once a set of pairs {g, m) has been selected, the third phase of the exploration 
method completes the meta-patterns identifying suitable minimum thresholds 
for support and confidence. 

Here, the user is asked to indicate a measure of the number of rules he/she 
considers interesting. Then, by means of a heuristic technique that performs sim- 
ple analysis of the data, minimum threshold values that approximately respect 
the interest measure are identified. 

Number of Distinct Rule Attribute Values. The user defines the number 

of distinct values for the rule attribute m associated by rules in the set R. 
Coverage of Rule Attribute Values. The user defines the coverage ratio r^, 
i.e. the percentage of distinct values of the rule attribute m appearing in the 
extracted rules w.r.t. the entire set Vm-^ 

We now introduce the heuristic technique for the minimum support. 

Pre- analysis of the data. The goal of this step is the extraction from the data set 
of the subset of values Vm C Vm, with \Vm\ = having the highest support 
values. Each value in Vm is ordered w.r.t. its frequency: denoting a value as u,, 
with 1 < i < \Vm\, and its frequency as pr{vi), it holds pr{vi) > pr{vi+i), for 
each 1 < i < \Vm\ 

Vm contains the values with the highest support because, in absence of any 
kind of knowledge about the real data distribution, these are those with the 
highest probability of composing rules with the highest supports. 

Evaluation of Minimum Support and Confidence. Computed the set Vm = 
{uj|l < i < n^} of most frequent values of to, consider the set of pairs 
^ < j < = 'Vj+i}, where I is the ceil of the 

quotient n^/2. For each pair kj € K , compute the actual frequency pr(kj) with 
which the pair appears in groups. At this point, choose the pair k which is im- 
mediately over the third quartile, and take pr{p). The value s = pr{p) is then 
used in the meta-pattern as minimum support. 

Let us determine the minimum confidence: for each pair of values kj = 
(vj^i,Vj^ 2 ) having sufficient support, compute the two conditional probabilities 
pr(vj^i\vj^ 2 ) and pr(vj^ 2 \vj^i), and compute the average pr of all the computed 
conditional probabilities. The value c = ^ is then used as minimum confidence. 



® Observe that nv = \V\ x Cc/lOO. 
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Example 4: Consider the pair (customer, product) selected in Example 3. We 
now identify the minimum support and confidence values. The three tables below 
reports the result of the analysis performed on the source data set. In particular, 
the left table reports the support of each product, listed in decreasing support 
order. Then, we choose to consider all the products (n^ = 6), as a consequence 
of the fact that the number of distinct products is very small; the middle table 
reports the support of the three considered pairs of products, listed in decreasing 
support order. Finally, the right table shows the conditional probabilities of each 
pair of products with support greater than zero. We establish a minimum support 
s = 0.18 and a minimum confidence c = 0.59, getting the rule set in Ex. 2. 

prod. gr. supp. prod. gr. supp. pair gr. supp. cond. prob. cond. prob. 

~~K 6 0.55 C 3 0.28 (A,C) 2 0.18 C|A 0.33 D|E 0.4 

B 6 0.55 F 3 0.28 (E,D) 2 0.18 A|C 0.66 E|D 1 

E 5 0.45 D 2 0.18 (B,F) 0 0 □ 

Discussion 1: It is known from statistics that given the frequencies (probabili- 
ties) pr(ei) and pr(e 2 ) of two events e\ and 62, if they are independent the joint 
frequency is pr(ei) x pr(e 2 ). However, we are analyzing data that in general 
do not meet the independence hypothesis; consequently, the joint frequency is 
pr(ei) X pr(e 2 \ei), where pr(e 2 \ei) is the conditional probability of 62 given ei. 
Hence, we have to compute the actual joint probability of each pair of values. 

In order to reduce the complexity of this computation, we limit the search 
to a sample composed of n^/2 pairs. The sample is chosen in such a way it is 
composed of a limited number of heterogeneous pairs, that involves all the 
considered values. Obtained that the frequencies of the sample pairs, we take the 
frequency of the pair immediately over the third quartile. This is consequence of 
the fact that the support of some pair in the sample may be too low; the third 
quartile avoids the suggested minimum support to be affected by such pairs. □ 

6 Conclusions and Future Work 

In this paper, we addressed the problem of defining meta-patterns for the ex- 
traction of association rules. A meta-pattern defines the features of data mining 
problems based on the known techniques for extracting association rules. These 
features are the source data set, the grouping attribute, the mined attribute, the 
minimum thresholds for support and confidence. 

This work is motivated by the observation that the definition of a meta- 
pattern that produces a quality association rule set, i.e. a rule set which is 
comprehensible, meaningful and informative for the user, is not a trivial task, 
because the data to analyze are totally or partially unknown. 

We propose an exploration heuristic technique that, by means of simple 
queries over the data, at first drives the user in the choice of the grouping and 
mined attributes, then, based on a suitable interest measure provided by the 
user, suggests the minimum thresholds for support and confidence. 

Experiences. We experienced the proposed exploration method in practical cases. 
In particular, a pool of users engaged in the analysis of data describing enrollment 
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of students to courses of a university adopted the proposed method. They were 
not familiar with data mining techniques and at the beginning of their work 
they found difficult to understand what kind of information to extract from the 
database and formulate meta-patterns. 

By means of the exploration method proposed in this paper, they were able 
to quickly address the problem, focusing their attention to a limited number of 
promising meta-patterns. Finally, with a limited effort, they were able to identify 
suitable values for minimum support and confidence, obtaining rule sets with 
the desired synthesis level. The users considered very informative the generated 
rules, and were satisfied by the fastness with which they obtained such results. 
Future work. In our opinion, there is a significant amount of work to do in the 
direction addressed by this paper. In fact, there is need for a methodological 
framework that covers the different activities concerning the data mining phase 
of the knowledge discovery process: these activities range from the definition of 
mining meta-patterns, to the evaluation of extracted rule sets. 

As far as the definition of meta-pattern is concerned, we are now studying a 
method that exploits the typical star data schema of data warehouses, to discover 
complex meta-patterns based on the mining operator introduced in [8] , which is 
defined on a more complex semantic model than that considered in this paper, 
and takes advantage by the structure of the data warehouse. 
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Abstract. This paper proposes fuzzy association rule which is a more 
generalized concept than boolean, quantitative, and interval association 
rules. Fuzzy association rule is a spectrum of definitions. Each particu- 
lar fuzzy association rule can be defined by adding restrictions on the 
fuzziness depending on the needs of practical situations. The definition 
of fuzzy association rule also fills in the gap between fuzzy functional 
dependencies and clusters and results in a whole spectrum of concepts 
which is called data association spectrum. Such a unified view has prac- 
tical implications. For example, various data mining problems can be 
converted to clustering problems and take advantage of the availability 
of a large number of good clustering algorithms. 



Key Words: Fuzzy association rule, fuzzy functional dependency, measurement. 



1 Introduction 

One of the main subjects in the data mining research is association rule mining 
[1,3]. Let R = {oi, 02 , ..., Om} be a relation schema, which is a set of attributes 
(column titles) defining the format of relations. Let r be an instance of R, i.e., 
r is a relation with the attributes of R. This means that r is composed of many 
tuples, each tuple has exactly m entries and each entry is under one of the m 
attributes a,, i = 1,2,..., to. An association rule [1,2] is an implication written 
as a => b where a is a value or vector in an attribute set X{c R) and 6 is a 
value or vector in an attribute set Y(c R). An association rule has support and 
confidence. The support of the rule a => bis siis tuples in r contain both a and b 
in attribute sets X and Y. The confidence of the rule is c (c < 1, a percentage) if 
c of all tuples that contain a also contain b. This is called a boolean association 
rule problem [8]. Usually, minimal threshold of support and confidence are 
given before the mining operation starts and only those rules whose supports 
and confidences are higher than the minimal threshold are selected as useful 
rules. 

Mukesh Mohania and A Min Tjoa (Eds.): DaWaK’99, LNCS 1676, pp. 229-240, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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A more general form of association rule, the quantitative association rule [8], 
or interval association rule [9], is defined as an implication A B, A is a vector 
of intervals each of which is in one of the attributes in X. B is defined similarly 
with respect to Y. The support of the rule is s if s tuples fall in intervals of A 
when projected to X and also fall in intervals of B when projected to Y. The 
confidence of the rule is c (c < 1, a percentage) if c of all the tuples that fall in 
intervals in A also fall in intervals in B. 

1.1 A Motivating Example 

A large corporation schedules and implements hundreds of projects each year to 
improve its productivity and quality of its products. An association rule mining 
is done to help company management to better understand how the company’s 
business is benefiting from these projects. 

To keep discussion simple, assume the relational table to be mined has only 
four attributes: budget, time, productivity, and quality. Each row in the table 
represents a project. The budget entry of the row contains the amount of funds 
spent on the project. The time entry contains the length of time to complete 
the project. The productivity entry contains measurable relative productivity 
improvement in some parts of company’s business due to the completion of the 
project. The quality entry contains measurable relative quality improvements in 
some of the company’s products and services. 

For each row in the table, budget and time entries represent a point (bud- 
get, time) in a two dimensional space and productivity and quality entries also 
represent a point (productivity,quality) in another two dimensional space. 

A mined quantitative association has the form A B, where A is a set 
of points in the budget-time space and B is a set of points in the productivity- 
quality space. Under the definition of quantitative association rule, A and B have 
to be in rectangles [8]. However, this example clearly shows that the restriction 
to rectangles is not necessary. 

1.2 Contributions of This Paper 

The initial motivation of this paper is that this “cubic” shape restriction of 
the antecedent and consequent of the quantitative association rule should be 
removed. This is necessary because in the real world, the actual shape of the 
antecedent and consequent may not be in “cubic” shape. By removing this re- 
striction, a spectrum of association rules are defined in this paper. The most 
general form, i.e., the generic form of the the association rule definition is free 
of any “shape” restriction and this type of rules are shown to be useful by the 
above example. The shape concept is introduced into the association rule only 
after some kind of measurement is used. Different measurements can be used to 
define different forms of (fuzzy) association rules, from the most generic rule to 
the simplest rule, i.e., boolean association rule. 

This paper also explores the similarity between boolean association rule and 
the classical functional dependency. For a relation schema R and an instance r 
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(a dataset) of R, let attribute set X C R, and Y C R.lfY depends on X in the 
sense of classical functional dependency, written as X Y, then this classical 
functional dependency may contain a set of boolean association rules. It is easy 
to compute these rules: if a value a appears in the attribute set X of a sufficient 
number (above a threshold) of tuples, and b is the value in Y corresponds (on 
the same tuple as) to a, then a =F 6 is a boolean association rule. The support 
of this rule is sufficient and the confidence of this rule is 100%. 

Still another interesting discovery is that this relationship between functional 
dependency and association rule exists parallelly between fuzzy functional de- 
pendency and fuzzy association rules. 

Graphically, a boolean association rule can be represented by two points in 
a two dimensional space (details will be discussed later in this paper). This 
graphical representation can be extended to quantitative and the more general 
fuzzy association rules by using clusters of points on a two dimensional space. 



Different measurements determine 
different types of fuziness 



Dependencies; fuzzy funcitonal dependency ^ functional dependency 



Rules: fuzzy association rule boolean association rule 



graphical ' » 

representations: cluster point 



Fig. 1. Sketch of Relationships Discussed in the Paper 



As sketched in Fig.l, a number of relationships are discussed in this paper. 
These concepts and their relationships form a symmetric grid, which we called 
data association spectrum. The details of this spectrum will be discussed in the 
following sections. 

The significance of such a spectrum is that it provides some insight into the 
relationship among these concepts. For example, it is known that association 
rules can be mined by using some scanning algorithm such as Apriori algorithm, 
and it is also known that some spatial clustering algorithms can be used to find 
clusters. From this spectrum, it is clear that fuzzy association rules can be mined 
by using some spatial clustering algorithms. So, one practical implication of the 
establishment of the data association spectrum is that it can help to solve some 
practical mining problems such as to find appropriate mining algorithms to some 
particular rule mining problems. 

Compared to other extensions such as H-rule [10] which extends the associa- 
tion rule concept to data associations across multiple relations, the focus of the 
current paper is that it extends the set-to-set (which includes the point-to-point 
boolean association rule [1] as a special case) association rule within a single 




232 Y. Yang and M. Singhal 



relational (or transactional) dataset to the most generalized form: the generic 
form, as defined in Section 2. 

Section 2 defines the generic form of the fuzzy association rule. Section 3 
discusses some measurements that will be used in further specifying fuzzy asso- 
ciation rules. Section 4 defines some more specific fuzzy association rules using 
measurements. Section 5 explores the relationship between fuzzy functional de- 
pendencies and fuzzy association rules. Section 6 explores the relationship be- 
tween fuzzy association rules and clusters. Section 7 concludes the paper with 
some insight to the nature of the association rule mining problem. 



2 Fuzzy Association Rule (Generic Form) 

2.1 A Geometric View of the Association Rule 

Supermarket basket data are frequently used in the discussions of the association 
rule mining. A basket dataset is composed of a large number of shopping records. 
Assume the support threshold is set to 800 for a boolean association rule mining 
problem, and the confidence threshold is set to 50%. Suppose there are total 
of 1,200 customers bought cake mix and total of 1,000 customers bought both 
cake mix and cake icing, (cake mix) (cake icing) is a boolean association 
rule with support 1,000, and confidence 1,000/1,200 = 83.3%. This rule has a 
geometric interpretation in a XY space. For X, 1 is used to denote a shopping 
record containing cake mix. For Y, 1 is used to denote “buy cake icing” and 0 
to denote “not buy cake icing” . This rule can be viewed as involving two points, 
(1,1) and (1,0). The support 1,000, interpreted geometrically, means that there 
are 1,000 overlapped dots (1,1) in XY space. The confidence 83.3%, interpreted 
geometrically, means that 83.3% of all dots, whose projection on subspace X is 
1, are at (1,1). 

More generally, a data set r (such as the basket dataset) can be viewed as 
a set of dots in a m-dimensional space R = {Xi,X 2 , In the case of 

basket data, m is the number of different type of goods in the supermarket. An 
attribute, or a set of attributes can be viewed as a subspace of R. Let A B 
be an association rule, where A and B are individual values or vectors in some 
subspaces of R. Also suppose A is in subspace X{C R, X represents a subset of 
goods in the market) and B is in subspace Y{C R). Conceptually, there are two 
sets of axes, one for X and one for Y. A boolean association rule only involves two 
points in space R. A quantitative association rule [8] or an interval association 
rule [9] involves high dimensional cubes, one on the axes of X with dimension 
I X I (the number of attributes in X) and the other on the axes of Y with 
dimension | Y |. 

With this geometric interpretation, one can see that in association rules such 
as the quantitative association rule, the “cubic” shape of the antecedent and the 
consequent is subjective. This observation motivates a more generalized defini- 
tion of association rule. 
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2.2 The Generic Form of Association Rnles 

Support and confidence are two basic conditions in the definition of association 
rules. These two conditions should exist in the more generalized association rule 
definition. In the following discussion, Pi[X] denotes the projection of Pi, which 
is in space i?, into subspace X. 

Definition 1: Fnzzy Association Rnle (Generic Form) 

i? is a relation schema with m attributes. X C R,Y C R. fx is a strength 
function on X and fy is a strength function on T. ax and ay are support 
thresholds. /3 is a confidence threshold. Also, let P\ and P^ be two sets 
of (to — dimensional) points in R, Pi[iA] C P2[A"]. 

A fuzzy association rule P2 [X] =F Pi [T ] holds if the following two con- 
ditions are satisfied: 

1. (support condition) If fx(Pi[X]) > ax, fy(Pi[Y]) > ay. 

2. (confidence condition) If /a;(Pi[A"])//j,(Pi[y]) > ( 3 . 

This generic form of association rule is free of any “shape” restrictions on 
the antecedent or consequent of the rule. If the antecedent and consequent are 
restricted to cubes, this generic form of association rule can be reduced to the 
quantitative association rule [8] and interval association rule [9]. In practice, 
the kind of restrictions applied to this generic form can be tailored to the needs 
of particular applications. 

The restriction can be achieved by providing measurement (distance) in the 
TO dimensional space R. Intuitively, by varying the type of measurement used, one 
can “carve” different shapes of the antecedent and consequent. For example, in 
a two dimensional space, a distance Pi(pi,P2) = | xi — X2 | + | yi —1/2 | defined 
between two points pi = (xi,yi) and p2 = (x2,y2) carves out a rectangular 
shaped fuzzy association rule. A distance D2{pi,P2) = y / (xi — 2:2)^ + (yi — 1/2)^ 
carves out a circular shaped fuzzy association rule. 

As this paper focuses on establishing the concept frame work for a spectrum 
of fuzzy association rules, a particular shaped fuzzy association rule and its 
usefulness in practical applications is not elaborated here. One of the shapes, 
the rectangular shape, corresponds to quantitative association rule [8] and is 
recognized as a useful type of rule. 

3 Combining Measurements 

Before discussing the fuzzy association rules defined by using measurements 
and even before the discussion of various measurements, the combination of 
measurements is discussed first. A combining operation defines a measurement in 
the high dimensional space by combining measurements in the lower dimensional 
spaces. A well defined combining operation is needed because the number of 
dimensions should not affect the definition of a fuzzy rule. In other words, a 
fuzzy rule defined using a particular type of measurement in a low dimensional 
space should use the same type of measurement, after combination, in the high 
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dimensional space. The following definition is adopted and modified from the 
fuzzy EQ relation in [4]: 

Definition 2: Measnrement 

Let R be a relation sehema, r be a relation over R, and X C R. Then, 

D is a measurement in X if 

1. D(a,a) = 0, \/a G X . a eould be T (undefined). 

2. D{a,b) >0,ya,b£X, a^ b. 

3. D(a,b) = D(b,a), ya,b£ X. 

The second condition is needed to ensure that decompositions over fuzzy 
functional dependency have the lossless join property, as proved in [4]. 

There are many ways to combine measurements defined in lower dimensional 
subspaces into measurement in higher dimensional space. We define the most 
general form of combining operation as: 

Definition 3: Combination Measnrement (Generic Form) 

Let R(Ai, A 2 , ■■■, An) be a relation sehema, r be a relation over R, and 
tuples ti,t 2 G r. Let measurements D\,D 2 , ■■■,0^ be defined on mutually 
non-interseeting subsehemas (subspaees) of R: Si, S 2 , ■■■, Sk, respeetively. 
Then, the eombination measurement D is defined as 

D{ti,t2) = f{Di{ti[Si],t2[Si]),...,Dk{ti[Sk],t2[Sk])). (1) 

where f is a funetion of k positive variables and satisfies the following 
eonditions: 

1. f = 0, if all the variables are equal to zero. 

2. f > 0, if at least one of the variables is nonzero. 

3. f{ai, 02 ,..., Ok) > /(0,...,0,aj,0,...,0), i = 1, 2, ..., k. 

Depending upon how / is defined, there can be many different variations of 
the combining operation for measurements. For example, if /is a function using 
maximum, i.e., / = max(*, *, ..., *), then the combining operation is the same 
as in [4-6]. We define two particular combining operations as presented below: 

Definition 4: Combination Measnrement (MAX) 

Let R(A\,A 2 , ...,An) be a relation sehema. Let measurements D\,D 2 , ...,Dk 
be defined on mutually non-interseeting subsehemas (subspaees) of R: 

Si, S 2 , ..., Sk, respeetively. Then, a eombination measurement D ean be 
defined as 

D(ti,t 2 ) = max{Di(ti[Si],t 2 [Si]), ..., Dk(ti[Sk],t 2 [Sk])}. (2) 

Definition 5: Combination Measnrement (MUL) 

Let R(Ai,A 2 , ...,An) be a relation sehema. Let measurements Di,D 2 , ...,Dk 
be defined on mutually non-interseeting subsehemas (subspaees) of R: 

Si, S 2 , ..., Sk, respeetively. Then, a eombination measurement D ean be 
defined as 



D{ti,t2) = V{Di{ti[Si],t2[Si]y + ... + Dk{ti[Sk],t2[Sk]y) (3) 
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The following definition is also adopted from [4]. It is nothing more than a 
formal and short way of saying that the tuple is meaningful in subspace X. This 
definition is needed for the convenience of subsequent rule definitions. 

Definition 6: X— defined 

Let R = (X\,X 2 , ■■■,Xn), and each X, has data domain dom(Ai), i = 

1, 2, ... , n. Also, X C R. A tuple t = (oi, 02 , ...,On) is X-defined, if for 

each attribute X, € X, t[X,] ^ T (undefined). 

Also, for convenience of reference, we call a volume which contains a data 
point p a neighborhood of p and denote it by S(p). This notion is needed in 
the subsequent definition for defining a cluster (a dense set of points) centered 
around a central point. 

A particular neighborhood can be defined by using a particular form of mea- 
surement. Let i? be a relation schema, subschema (subspace) X C i?, r be an 
instance (a dataset) of R, a tuple to G r, and T> be a measurement defined on 
subspace X. Let c > 0 be a constant, then to[X] can have a neighborhood defined 
as 



{t[X]\ D{t[X]MX]) < C} (4) 

4 Some Specific Fuzzy Association Rules 

This section discusses several measurement definitions and the resulting mea- 
surement based definitions of fuzzy association rules. Each measurement based 
definition of fuzzy association rule is a particular case of the generic form of 
fuzzy association rule defined in the last section. 

Definition 7: Fnzzy Association Rnle (Neighborhood) 

Let rbe a relation over schema R, and subschemas (subspaces) X,Y C R. 
Given an XY-defined tuple to & r, a neighborhood NX(to[X]) of to[X] 
in subspace X and a neighborhood NY(to[Y]) of to[Y] in subspace Y , a 
density function f defined on both X and Y, > 0, > 0, a > 0, 

/3 > 0. A fuzzy association rule to[X] =F to[Y] holds in r if all of the 
following four conditions are satisfied: 

1. (cluster condition in X^/(5i[X]) > Sx, where 
Si={t\ t[X] G NX{to[X])}. 

2. (cluster condition in Y) /(52[X]) > Sy, where 
S 2 ={t\ t[Y] G NX{to[Y])}. 

3. (support condition) 

m =\{t\t is XY-defined, t[X] G N X (to[X]) , t[Y] G NY(to[Y])} \> 
a. 

4 . (confidence condition) ni/ri 2 > fd, where 

ri 2 = \ {t \ t is X-defined, t[X] G XX(to[X])} |. 
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Compared to Definition 1, the first two are extra conditions intended to 
make the antecedent and consequent “dense” enough so that the rule is useful 
in practical situations. The last two conditions are just the same support and 
confidence condition being defined more specifically in terms of neighborhoods. 
The set 5i is called the rule base, NX(to[i^]) is the body of the rule base, the set 
S 2 is the rule target, NY(to[T]) is the body of the rule target, {t 1 1 is XY-defined, 
t[X] e NX{to[X]),t[Y] e Y‘Y(to[T])} the rule support set, and the combined 
neighborhood NX{to[X]) x NY{to[Y]) in subspace XY is the body of the rule 
support set. The body concept is the generalization of the high dimensional cube 
concept used in the quantitative association rule [8]. 

If the shape of a neighborhood is a cube in a high dimensional subspace, the 
restricted rule is exactly the quantitative association rule [8] or interval associ- 
ation rule [9]. Also, a single rule definition may use more than one measurement 
as the next example shows. 

Definition 8: Fnzzy Association Rnle (Measnrement) 

Let r be a relation over sehema R and subsehemas (subspaees) X,Y C R. 

Let D\ be a measurement on X and D 2 be a measurement on Y. c > 0, 
Q!>0, /3>0, (5>0 are given. Let tuple to be XY-defined. A fuzzy 
assoeiation rule to[X] to\Y] holds in r if: 

— (support eondition) s > a, where 

s = \ { t \ t is XY-defined and Di{t[X],to[X]) > c* D 2 {t[Y],to[Y]), 

Di{t[X],to[X]) < S} \. 

— (eonfidenee eondition) s/b> (3, where s is defined above, and 
b= I {t \ D^{t[X],to[X]) < S} \. 

The support and the confidence conditions can be changed by varying <5 
(support threshold) and /3 (confidence threshold) , and also by varying c (strength 
of the implication) . 

5 Fuzzy Dependencies and Fuzzy Rules 

The Definition 8 for the fuzzy association rule is very similar to that of the fuzzy 
functional dependency defined in [4-6] , except that the inequality may not hold 
for all data points in XY. 

In Definition 8, if the minimum support condition is dropped and the in- 
equality holds for all data points in XY, the result is an exact definition of fuzzy 
functional dependency: 

Definition 9: Fnzzy Fnnctional Dependency (Generic Form) 

Let r be a relation over R. X,Y C R. A fuzzy funetional dependeney 
(ffd) X ^ Y holds in r if for all tuples t\,t 2 G r, ti{X\ € NX(t 2 {X\), 
or t 2 {X] e NX(ti[X]) implies that either both t\[Y] and t 2 [Y] are un- 
defined or there is a subspaee Y’ of Y, t\\Y'\ € NY'(t 2 \Y\) or t 2 \Y'\ € 
NY'{ti[Y]). 
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Definition 9 is more general than previously defined fuzzy functional depen- 
dencies in [4-6] because the more general concept of neighborhood is used. 

Definition 10: Fnzzy Fnnctional Dependency (Measnrement) 

Let r be a relation over sehema R, and X,Y C R, D\ is a measurement 
on X and D 2 is a measurement on Y. c > Q is a eonstant. A fuzzy 
funetional dependeney (Sd) X ^ Y holds in r if for all X-defined tuples 
h,t 2 G r, 

D,{h[X],t2[X]) > c* D2{h[Y],t2[Y]) ( 5 ) 

The above definition includes both cases in the definition of fuzzy functional 
dependency in [4]. To be specific, in [4], for defining fuzzy functional depen- 
dency, the conditions that t\ and t 2 should be satisfied are one of the following: 

1. If there exists a nonempty set Y' C Y, such that ti[T"j 7^ T, t 2 \Y”] ^ T, for 
each Y" e T', and ti[Y-Y’j = t2[Y-Y’j = T, then, t\ and t 2 should satisfy: 

Di{ti[X],t2[X]) > D2{ti[Y'],t2[Y'\) (6) 

This case is included in our definition. 

2. \iti[Y\=t2[Y\ = Y. 

In this case, we have T>2(ti[Y],t2[Yj) = 0 according to our measurement 
definition. So, Di(ti{X\,t 2 {X\) > c* D 2 (ti\Y\,t 2 \Y\) will hold. 

Definition 10 not only includes previous definition of fuzzy functional depen- 
dency, with the use of threshold c, definition 10 is also more generalized. 

If the underlying relation is not a fuzzy one, but a classical relation, fuzzy 
functional dependency will reduce to the classical functional dependency. There- 
fore, our definition of the fuzzy functional dependency, being a more general one 
than that in [4] , is indeed a generalization of the classical functional dependency. 

Comparing definitions 8 and 10, we can see that the fuzzy association rule 
and the fuzzy functional dependency are so closely related, that the essential 
difference is just that the fuzzy functional dependency is an assertion over all 
data points of the subspace. 



6 Fuzzy Association Rules and Clnsters 

We can also define clusters using measurement. The following is one of the cluster 
definitions: 

Definition 11: Clnster (Measnrement) 

Let Y be a subspace of i?, i.e., X C R. Let D be a measurement defined 
on Y, a > 0 and <5 > 0 are given. A cluster in Y is a set of points 
Pi,P2, ■■■,Pk such that D(pi,pj) < <5, V i, j = 1, 2, ..., k, and k > a. 
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As shown in Fig. 2, a fuzzy association rule can be viewed as a special cluster 
which crosses two different subspaces. Assume r is an instance of schema R, and 
X,Y C R. Let Dx and Dy be measurements on X and Y, respectively. Suppose 
Definition 8 is used as the definition of the fuzzy association rule, and D^, Dy 
as measurements in X and Y, strength of implication is c > 0, rule base quality 
is (5 > 0, support threshold is a, and confidence threshold is /3. 




X Y XY 



Fig. 2. Rules and Clusters 



Let 5 be a set of tuples in r satisfying a fuzzy association rule 5[X] 5[Y] 

under definition 8, i.e., \/pi,pj € S, pi and pj satisfy: Dx(pi[X],pj[X]) < S, 
c * Dy{pi[Y],pj[Y]) < D^{pi[X],pj[X]), and | 5[XY] |= s > a. If we take 
deltaciuster = max{de/ta, delta /c} and aduster = \ ^[Xy] |, then, 5[Xy] is also 
a cluster under definition 11 with parameters deltaduster and aduster- 

This kind of association between fuzzy association rules and clusters is useful 
in computation. Let’s return to the motivating example at the beginning of this 
paper. The distribution of the data points in space X = (budget, time) are 
uneven. It is dense in some areas and sparse in others and similar situation 
exists for space Y = (productivity,quality). It is computationally efficient to first 
find all clusters in X and Y, respectively. Then, only for those acceptable clusters 
(which are dense enough and the count of data points are large enough) we try 
to match them into fuzzy association rules. In this particular example, a fast 
clustering algorithm, with some modification, can be used as a fuzzy association 
rule mining algorithm. 



7 Conclusion and Future Work 

This paper proposed the concept of the fuzzy association rule, which is the 
most generalized form of the set-to-set association rule within a single relational 
dataset (this can be easily translated into transactional dataset and vice versa) , 
explored the essential commonalities as well as differences between functional de- 
pendencies and association rules, and the relationship between fuzzy association 
rules and clusters. 
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Fig. 3. Data Association Spectrum 



For very large databases, the concept of fuzzy association rule is very useful 
in two ways. First, the fuzziness allows data miners to concentrate on clusters 
instead of individual data points. Second, sometimes the fuzziness reflects more 
faithfully to the real world complexity because in many cases, the relationship 
between data are fuzzy by nature, and we would lose insight to the nature of the 
problem if we only concentrate on small details. 

The concept of functional dependency can take various forms. By using vari- 
ous definitions of measurements (distance measure) , various forms of functional 
dependencies can be obtained. All these different forms of functional dependency 
constitute a spectrum of concepts, from the most abstract concept of fuzzy func- 
tional dependency to the simplest concept of classical functional dependency 
commonly used in the relational database theory. This concept spectrum may 
be called fuzzy functional dependency spectrum and is represented by the second 
row in Fig. 3. 

Similarly, the association rules can take various forms. By using different 
definitions of measurements, different forms of association rules can be obtained. 
The most abstract concept among them is the generic form of fuzzy association 
rule defined in Section 2 and the simplest one is the boolean association rule [1]. 
The quantitative [8] and the interval association rules [9] are two of the concepts 
somewhere in the middle of this spectrum of concepts. We call this spectrum the 
fuzzy association rule spectrum and is represented by the third row in Fig. 3. 

The concept of the most abstract form of fuzzy functional dependency is 
closely related to the most abstract form of fuzzy association rule. The various 
forms of fuzzy functional dependencies defined by using various kinds of mea- 
surement are closely related to the fuzzy association rules defined by using the 
same kind of measurement. This parallelism suggests that these two concept 
spectra can be put into a common data association spectrum, as shown in Fig. 3. 

Also, by varying the definition of measurement, various clusters can be ob- 
tained. There is also a similar spectrum exist in the definition of clusters from 
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the most general one, i.e., the topological one, to the singlest one, i.e., a point. 
We call this a cluster spectrum and is represented by the fourth row in Fig. 3. 

Together, all three concept spectra constitute a bigger spectrum, which we 
call data association spectrum. What is achieved here is a deeper understand- 
ing to the nature of a series of problems such as association rule mining and 
clustering and their relationships. For example, between the concept of fuzzy 
functional dependency and cluster, there is a concept of fuzzy association rule 
(because functional dependency is data value association on the whole dataset 
and association rule is data value association on partial dataset). So, if a clus- 
tering algorithm cannot even be used in mining any fuzzy association rules, it 
cannot be used to find any fuzzy functional dependencies. 

There is a whole spectrum of practical problems that can be regarded as fuzzy 
association rule mining problems. Because of the complexity of fuzzy association 
rule mining problem, there is still a lot to be explored such as the different types 
of the fuzzy association rules, the interestingness issues in fuzzy association rule 
mining, and new data mining algorithms for discovering fuzzy association rules. 
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Abstract. The explosive growth in data collection in business orgar 
nizations introduces the problem of turning these rapidly expanding 
data stores into nuggets of actionable knowledge. The state-of-the-art 
data mining tools available for this integrate loosely with data stored in 
DBMSs, typically through a cursor interface. In this paper, we consider 
several formulations of association rule mining (a typical data mining 
problem) using SQL-92 queries and study the performance of different 
join orders £ind join methods for executing them. We amalyze the cost of 
the different execution plans which provides a basis to incorporate the 
semantics of association rule mining into future query optimizers. Based 
on them we identify certain optimizations and develop the Set-oriented 
Apriori approach. This work is an initial step towards developing “SQL- 
aware” mining algorithms and exploring the enhancements to current 
relationail DBMSs to make them “mining-aware” thereby bridging the 
gap between the two. 

1 Introduction 

A large number of business organizations are installing data warehouses based 
on relational database technology and it is extremely important to be able to 
mine nuggets of useful and understandable information from these data ware- 
houses. The initial efforts in data mining research were to cull together techniques 
from machine learning ^lnd statistics to define new mining operations and develop 
algorithms for them. A majority of the mining algorithms were built for data 
stored in file systems and coupling with DBMSs was provided through ODBC or 
SQL cursor interface. However, integrating mining with relational databases is 
becoming increasingly important with the growth in relational data warehousing 
technology. 

There have been several research efforts recently aimed at tighter integration 
of mining with database systems. On the one hand, there have been several 
language proposals to extend SQL with specialized mining operators. A few 
examples are DMQL [4], M-SQL [6] and the Mine rule operator [7]. However, 
these proposals do not address the processing techniques for these operators 
inside a database engine. On the other hand, researchers have addressed the issue 
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of exploiting the capabilities of conventional relational systems and their object- 
relational extensions to execute mining operations. This entails transforming 
the mining operations into database queries and in some cases developing newer 
techniques that are more appropriate in the database context. The UDF-based 
(user defined function) approach in [2], the SETM algorithm [5], the formulation 
of association rule mining as “query flocks” [10] and SQL queries for mining [9] 
all belong to this category. 

Two categories of SQL implementations for association rule mining - one 
based purely on SQL-92 and the other using the object-relational extensions to 
SQL (SQL-OR) - are presented in [9]. The experimental results show that SQL- 
OR outperforms SQL-92 for most of the datasets. However, the object-relational 
extensions like table functions and user-defined functions (UDFs) used in the 
SQL-OR approaches are not yet standardized across the major DBMS vendors 
and hence portability could suffer. Moreover, optimization and parallelization of 
the object-relational extensions could be harder. 

In this paper, we analyze the performance of the various SQL-92 approaches 
and study the implications of different join orders and join methods. The motiva- 
tion for this study is to understand how best can we do with SQL-92. We derive 
cost formulae for the different approaches in terms of the relational operators 
and the input data parameters. These cost expressions can be used in any cost 
based optimizer. Based on the performance experiments and the cost formulae, 
we identify certain optimizations and develop the Set-oriented Apriori approach 
that performs better than the best SQL-92 approach in [9j. We also study the 
scale-up properties of Set-oriented Apriori. 

The rest of the paper is organized as follows: We review association rule 
mining and a few SQL formulations of it in Section 2. In Section 3, we present a 
cost based analysis of the SQL approaches. Section 4 presents the performance 
optimizations and their impact on the execution cost 2 md discusses the Set- 
oriented Apriori approach. We report the results of some of our performance 
experiments in Section 5 and conclude in Section 6. 

2 Association Rules 

Association rules capture recurring patterns in large databases of transac- 
tions, each of which is a set of items. The intuitive meaning of a typical asso- 
ciation rule X-^y, where X and Y are sets of items, is that the items in X 
and Y tend to co-occur in the transactions. An example of such a rule might 
be that “60% of transactions that conteiin beer also contain diapers; 5% of all 
transactions contain both these items” . Here 60% is called the confidence of the 
rule and 5% the support of the rule. The problem of mining association rules is 
to find all rules that satisfy a user-specified minimum support and confidence 
threshold. This problem can be decomposed into two subproblems of finding the 
frequent itemsets (item combinations with minimum support) and generating 
the rules from them [Ij. 

2.1 Apriori Algorithm 

The basic Apriori algorithm [3] for discovering frequent itemsets makes mul- 
tiple passes over the data. In the Arth pass it finds all itemsets having k items 
c^led the fc-itemsets. Each pass consists of two phases - the candidate genera- 
tion phase euid the support counting phase. In the candidate generation phase. 
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the set of frequent {k - l)-itemsets, F*_i is used to generate the set of poten- 
tially frequent candidate fc-itemsets, (7*. The support counting phase counts the 
support of all the itemsets in Ck by examining the transactions and retains the 
itemsets having the minimum support. The algorithm terminates when Ck+i 
becomes empty. 

2.2 Apriori candidate generation using SQL 

We briefly outline the SQL-based candidate generation process in [9] here. 
Ck is obtained by joining two copies of Fk-i as: 

insert into C* select Ii.itemi, . Ii-iterrik-i, h-itenik-i 
from Fk-i Ii,Fk-i h 

where Ii.itemi — Ii.itemi and . . . and Ii.itemk-2 = h-itemk-2 and 
Ii.itemk-i < h.itemk-i 

The join result which is a set of fc-itemsets is further pruned using the subset 
pruning strategy that all subsets of a frequent itemset should be frequent. The 
subset pruning can be accomplished in SQL by additional joins with (Jfe— 2) more 
copies of Fk-i. 



2.3 Support counting by K-Way joins 



In this approach, the support counting is formulated as a join query. The 
transaction data is stored in a relational table T with the schema (iid, item). 
For a given tid, there are as many rows in T as the number of items in that 
transaction. For read-life datasets, the maximum and minimum munber of items 
per transaction differ a lot and the maximum number of items could even be 
more than the number of columns allowed for a table. Hence, this schema is 
more convenient than alternate representations. In the pass, k copies of the 
transaction table T are joined with the candidate table Ck and is followed up 
with a group by on the itemsets as shown in Figure 1. Note that the plan tree 
generated by the query processor could look quite different from the tree diagram 
shown below. 



insert into Fk select itemi, . . . itemu, count(*) 
from Ck,T ti, . . .T tk 

where ti.item = Ck-itemi and . . . and 
tk-item = Ck-itemk and 
ti.tid = t3.tid and . . . and 
tfc_i.tid = ffc.tid 
group by itemi,item2 ...itemk 
having count(*) > rminsup 




Fig. 1 . Support Counting by K-way join 

Subquery optimization. The basic KwayJoin approach can be optimized to 
make use of common prefixes between the itemsets in Ck by splitting the support 
counting phase into a sequence of k nested subqueries [9]. 

3 Analysis of execution plans 

We experimented with a number of alternative execution plans for this query. 
We could force the query processor to choose different plfuis by creating different 
indices on T and Ck , and in some cases by disabling certain join methods. We 
analyze two different execution plans below. 
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Fig. 2. K-way join plan with Ck as outer Fig. 3. K-way join plan with Ck as inner 
relation relation 



In the cost analysis, we use the mining-specific data parameters and knowl- 
edge about assoeijition rule mining (Apriori algorithm [3] in this case) to esti- 
mate the cost of joins and the size of join results. Even though current relational 
optimizers do not use this mining-specific semantic information, the analysis 
provides a basis for developing “mining-aware” optimizers. The cost formulae 
are presented in terms of operator costs in order to make them general; for in- 
stance join(p, q, r) denotes the cost of joining two relations of size p and q to get 
a result of size r. The data parameters and operators used in the analysis are 
summarized in Table 1. 



R 


number of records in the input transaction table 


T 


number of transactions 


N 


average number of items per transaction = § 


Fi 


number of frequent items 


S{C) 


sum of support of each itemset in set C 


Sk 


average support of a frequent fc-itemset = 


Rf 


number of records out of R involving frequent items = S{Fi) 


Nf 


average number of frequent items per transaction = ^ 


Ck 


number of candidate k-itemsets 


C{n, k) 


number of ^-combinations possible out of a set of size n: = 


group(n, m) 


cost of grouping n records out of which m are distinct 


join(p, q, r) 


cost of joining two relations of size p and q to get a result of size r 



Table 1. Notations used in cost analysis 

3.1 KwayJoin plan with Ck as outer relation 

Steirt with Ck as the outermost relation and perform a series of joins with the 
k copies of T. The final join result is grouped on the k items to find the support 
counts (see Figure 2). The choice of join methods for each of the intermedi- 
ate joins depends on the availability of indices, the size of intermediate results, 
amount of aveulable memory etc. For instance, the efficient execution of nested 
loops joins require an index {item, tid) on T. If the intermediate join result is 
large, it could be advantageous to materialize it and perform sort-merge join. 

For each candidate itemset in C*, the join with T produces as many records 
as the support of its first item. Similarly, the relation obtained after joining C* 
with I copies of T contain as many records as the sum of the support counts of the 
1-item prefixes of Ck- Hence the cost of the join is join(C7* * sj_i, R, Ck * si) 
where so = 1. Note that values of the Si’s can be computed from statistics 
collected in the previous passes. The last join (with T*) produces S{Ck) records 
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- there will be as many records for each candidate as its support. S(C'jt) can 
be estimated by adding the support estimates of all the itemsets in <7*. A good 
estimate for the support of a candidate itemset is the minimum of the support 
counts of all its subsets. The over£ill cost of this plan expressed in terms of 
operator costs is; 

join(C'fc R, Ck*si)} +join(C'/b R, 5((7fc)) + group(5(C'fc), Ck) 

1=1 

3.2 KwayJoin plan with Ck as inner relation 

In this plan, we join the k copies of T and the resulting fc-item combinations 
are joined with Ck to filter out non-candidate item combinations. The final join 
result is grouped on the A:-items (see Figure 3). The result of joining I copies 
of T is the set of all possible i-item combinations of transactions. We know 
that the items in the candidate itemset are lexicographically ordered and hence 
we can add extra join predicates as shown in Figure 3 to limit the join result 
to /-item combinations (without these extra predicates the join will result in 
/-item permutations). When Ck is the outermost relation these predicates are 
not required. A mining-aware optimizer should be able to rewrite the query 
appropriately. The last join produces 5(C*) records. The overall cost is: 

*-i 

join(C7(/V, /) * T, R, C{N, / -t- 1) * T)} + join(C(Ar, k) *T,Ck, S{Ck))+ 

1=1 group(5(C'jfe), C*), whereC7(Ar, 1)*T = R 

3.3 Effect of subquery optimization 

The subquery optimization (see [9] for the details) makes use of common 
prefixes among candidate itemsets. Unfolding all the subqueries will result in a 
query tree which structmally resembles the KwayJoin plan tree shown in Fig- 
ure 2. Subquery Qi produces ♦ s/ records where dj denotes the number of 
distinct j item prefixes of Ck- In contrast, the /** join in the KwayJoin plan 
results in C7* * sj records. The total cost of this approach can be estimated as 
below where trijoin(p, g, r, s) denotes the cost of joining three relations of size 
p, q, r respectively producing a result of size s. 

k 

{5^trijoin(ii, s,_i * d'*~S <4, sj * c4)} + group(5(Ct), Ck) 

1=1 

We observed tremendous performance improvements because of this opti- 
mization. The number of distinct /-item prefixes is much less compared to the 
total number of candidate itemsets. This results in correspondingly smaller in- 
termediate tables as shown in the analysis above, which is the key to the perfor- 
mamce gain. 

Experimental datasets. We used synthetic data generated according to the 
procedure explained in [3] for our experiments. The results reported in this paper 
are for the datasets - T5.I2.D100K and T10.I4.D100K. (for scale-up experiments 
we used other datasets also). For example, the first dataset consists of 100 thou- 
sand transactions, each containing an average of 5 items. The average size of 
the maximal potentially frequent itemsets (denoted as I) is 2. The transeu^tion 
table corresponding to this dataset had approximately 550 thousand records. 
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Both datasets had a total of 1000 items. The second dataset has 100 thousand 
transactions, each containing an average of 10 items (total of about 1.1 million 
records) and the average size of maximal potentially frequent itemsets is 4. 

All the experiments were performed on PostgreSQL Version 6.3 [8], a public 
domain DBMS, installed on a 8 processor Sun Ultra Enterprise 4000/5000 with 
248 MHz CPUs and 256 MB main memory per processor, running Solaris 2.6. 
Note that PostgreSQL is not parallelized. It supports nested loops, hash-based 
and sort-merge join methods and provides finer control of the optimizer to disable 
any of the join methods. We have found it to be a useful platform for studying 
the performance of different join methods and execution plans. 

4 Performance optimizations 

The cost analysis presented above provides some insight into the different 
components of the execution time in the different passes and what can be opti- 
mized to achieve better performance. In this section, we present three optimiza- 
tions to the KwayJoin approach (other than the subquery optimization) and 
discuss how they impact the cost. 

4.1 Pruning non-frequent items 

The size of the transaction table is a major factor in the cost of joins involving 
T. It can be reduced by pruning the non-frequent items from the transactions 
after the first pass. We store the transaction data as (tid, item) tuples in a 
relational table and hence this pruning can be achieved simply by dropping the 
tuples corresponding to non-frequent items by joining T and Fi . The pruned 
transactions are stored in table T/ which has the same schema as that of T. 

For some of the synthetic datasets we used in our experiments, this pruning 
reduced the size of the transaction table to about half its original size. This 
could be even more useful for real-life datasets which typically contains lots 
of non-frequent items. For example, some of the real-life datasets used for the 
experiments reported in [9] contained of the order of 100 thousand items out 
of which only a few hundred were frequent. Figure 4 shows the reduction in 
transaction table size due to this optimization for our experimental datasets. 
The initial size (i?) and the size after pruning (i?/) for different support values 
are shown. 




Fig. 4. Reduction in transaction table Fig. 5. Benefit of second pass optimiza- 
size by non-frequent item pruning tion 

4.2 Eliminating candidate generation in second pass 

In the second pass, C 2 is almost a cartesian product of the two Fis used 
to generate it and hence materializing it and joining with the T’s (or T/’s) 
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could be expensive. The generation of C 2 can be completely eliminated by 
formulating the join query to find F 2 as below. The cost of second pass with 
this optimization is join(/?/, Rf, C{Nf, 2)) + group(C'(iV/, 2),C7(Fi, 2)). Even 
though the grouping cost remains the same, there is a big r^uction from the 
basic KwayJoin approach in the join costs. 

insert into select p.item, q.item, count(*) from Tf p, Tf q 
where p.tid = q.tid and p.item < q.item 
group by p.item, q.item having count(*) > ;minsup 



Figure 5 compares the running time of the second pass with this optimization 
to the basic KwayJoin approach for the T5.I2.D100K dataset (similar trends were 
observed for other datasets also). For the KwayJoin approach, the best execution 
plan was the one which generates all 2 -item combinations, joins them with the 
candidate set and groups the join result. 

4.3 Set-oriented Apriori 



The SQL formulations of association rule mining is based on generating item 
combinations in various ways and similar work is performed in all the different 
passes. Therefore, storing the item combinations and reusing them in subsequent 
passes will improve the performance especially in the higher passes. In the k** 
pass of the support counting phase, we generate a table T* which contains all k- 
item combinations that are candidates. Tk has the schema (tid, item \ , . . . , iterrik). 
We join Tk-i, Tf and C* as shown below to generate T*. The frequent itemsets 
Fk is obtained by grouping the tuples of 7* on the k items and applying the 
minimum support filtering. 



insert into Tk 

select p.tid, p.itemi, . . .p.tteTnj,_i, q.item 

from Ck, Tk-i p, Tf q 

where p.itemi = Ck-itemi and . . . and 

p. item*-! — Ck-item.k-1 and 

q. item = Ck-itemk and 
p.tid = q.tid 



p.itemi 

p. ltem_k>l ^ ac.ltemjc-1 r><Cl 

q. item m C3c.lteaik 



p.tid 

p.ltem_k-l <q.item 






T_k-l p 



T1 q 



Fig. 6. Generation of T* 

We can further prune T* by filtering out item combinations that turned out to 
be non-frequent. However, this is not essential since we join it with the candidate 
set Ck+i in the next pass to generate Tk+i . The only advantage of pruning T* is 
that we will have a smaller table to join in the next pass; but at the expense of 
joining T* with F*. We use the optimization discussed above for the second pass 
and hence do not materialize and store T2. Therefore, we generate T 3 directly 
by joining Tf with C3 as: 

insert into T 3 select p.tid, p.item, q.item, r.item 
from Tf p, Tf q, Tf r, Ck 

where p.item = Cs.itemi and q.item = C 3 .item 2 and r.item = C 3 .item 3 
and p.tid = q.tid and q.tid = r.tid 

We can also use the Subquery approach to generate T 3 if that is less expensive. 
T3 will contain exactly the same tuples produced by subquery Q3. 

The Set-oriented Apriori algorithm bears some resemblance with the three-way 
join approach in [9], the SETM algorithm in [5] and the AprioriTid algorithm 
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in [3]. In the three-way join approach, the temporary table T* stores for each 
transaction, the identiWs of the ceindidates it supported. T* is generated by 
joining two copies of Tk-i with Ck- The generation of Fa, requires a further 
join of Tk with Ck- The candidate generation in SETM is different and hence 
the support counting queries are also different. The size of the intermediate 
tables Tk generated by Set-oriented Apriori is much smaller compared to the 
corresponding ones of SETM. AprioriTid makes use of special data structures 
which are difficult to maintain in the SQL formulation. 

Cost comparison. The A:** pass of Set-oriented Apriori requires only a single 
3-way join^ due to the materialization and reuse of item combinations and the 
costistrijoin(ii/, Tk-i, Ck, S'(C7*))-l-group(5(C'fc), Ck)- The table T*_i contains 
exactly the same tuples as that of subquery Qk-i and hence has a size of s/_i * 
Also, dj is the same as Ck- Therefore, the A:** pass cost of Set-oriented 
Apriori is the same as the A:** term in the join cost summation of the subquery 
approach. 

Figure 7 compares the running times of the subquery and Set-oriented Apriori 
approaches for the dataset T10.I4.D100K for 0.33% support. We show only the 
times for passes 3 and higher since both the approaches are the same in the first 
two passes. 




Fig. 7. Benefit of reusing item combina- Fig. 8. Space requirements of the 
tions set-oriented apriori approach 



Space overhead. Storing the item combinations requires additional space. The 
size of the table T* is the same as S{Ck), which is the total support of all the 
A:-item candidates. Assuming that the tid and item attributes are integers, each 
tuple in Tk consists of A: -T 1 integer attributes. Therefore, the space requirement 
of Tk is jlfcl ♦ (A:-l- 1). Figure 8 shows the size of T* in terms of number of integers, 
for the dataset T10.I4.D100K for two different support values. The space needed 
for the input data table T is also shown for comparison. T 2 is not shown in 
the graph since we do not materialize and store it in the Set-oriented Apriori 
approadi. Note that once T* is materialized T*_i can be deleted unless it needs 
to be retained for some other purposes. 

5 Performance experiments 

We compared the performance of Set-oriented Apriori with the Subquery ap- 
proach (the best SQL-92 approach in [9]) for a wide range of data parameters 
and support values. We report the results on two of the datasets - T5.I2.D100K 
and T10.I4.D100K - described in Section 3.3. 

* Note that this may be executed as two 2- way joins since 3-way joins are not generally 

supported in current relational sjrstems. 




Evaluation and Optimization of Join Queries for Association Rule Mining 



249 



Figure 9 shows the total time taken for each of the different passes of the 
Subquery and Set-oriented Apriori approaches. We rein the SETM algorithm [5] 
also for a few support values and found it to be an order of magnitude slower. 
Set-oriented Apriori performs better than Subquery for all the support values. The 
first two passes of both the approaches are similar and they take approximately 
equal amount of time. The difference widens for higher numbered passes as 
explained in Section 4.3. For T5.I2.D100K, F 2 was empty for support values 
higher than 0.3% and therefore we chose lower support values to study the 
relative performance in higher numbered passes. 




Fig. 9. Comparison of Subquery and Set-oriented Apriori approaches 

In some cases, the optimizer did not choose the best plan. For example, for 
joins with T (T/ for Set-oriented Apriori), the optimizer chose nested loops plan 
using (item, tid) index on T in many cases where the corresponding sort-merge 
plan was faster - an order of magnitude faster in some cases. We were able 
to experiment with different plans by disabling certain join methods (disabling 
nested loops join for the above case). We also broke down the multi-way joins 
into simpler two-way joins to study the performance implications. The reported 
times correspond to the best join order and join methods. 

In cill the experiments, we measured the CPU and I/O times separately. An 
interesting observation is that the I/O time is less than one third of the CPU 
time. This shows that there, is a need to revisit the traditional optimization and 
parallelization strategies designed to optimize for I/O time, in order to handle 
the newer decision support and mining queries efficiently. 

5.1 Scale-up Experiment 

Figure 10 shows how Set-oriented Apriori scales up as the number of trans- 
actions is increased from 10,000 to 1 million. We used the datasets T5.I2 and 
T10.I4 for the average sizes of transactions and itemsets respectively. The mini- 
mum support level was kept at 1%. The first graph shows the absolute execution 
times and the second one shows the times normalized with respect to the times 
for the 10,000 transaction datasets. It can be seen that the scale-up is linear. 

The scale-up with increasing transaction size is shown in Figure 11. In these 
experiments we kept the physical size of the database roughly constant by keep- 
ing the product of the average transaction size and the number of transactions 
constant. We fixed the minimum support level in terms of the number of trans- 
actions, since fixing it as a percentage would have led to large increases in the 
number of frequent itemsets as the transaction size increased. The numbers in 
the legend (e.g. 1000) refer to this minimum support. The execution times in- 
crease with the transaction size, but only gradually. The main reason for this 
increase was that the munber of item combinations present in a transaction 
increases with the transaction size. 
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Fig. 10. Number of tiaosactions scaie-up 



Fig. 11. lYansaction size 
scale-up 



6 Conclusion 

We explored the problem of developing SQL-aware implementations of as- 
sociation rule mining. We analyzed the best available SQL-92 formulation - 
KwayJoin approach with Subquery optimization - primarily from a performance 
perspective and conducted detailed performance experiments to understand how 
well current relational DBMSs handle such queries. Based on the cost evalua- 
tion and the performance study we identify certain optimizations and develop 
a set-oriented version of the apriori algorithm. For the higher numbered passes, 
Set-oriented Apriori performs significantly better than the Subquery approach. 
The cost analysis presented in this paper points to useful enhancements to cur- 
rent optimizers to make them more “mining-aware” . We also studied the scale-up 
behavior of Set-oriented Apriori with respect to increase in the number of trans- 
actions and average transaction size. 
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Abstract. Efficient index construction in multidimensional data spaces is important 
for many knowledge discovery algorithms, because construction times typically 
must be amortized by performance gains in query processing. In this paper, we pro- 
pose a generic bulk loading method which allows the application of user-defined 
split strategies in the index construction. This approach allows the adaptation of the 
index properties to the requirements of a specific knowledge discovery algorithm. 

As our algorithm takes into account that large data sets do not fit in main memory, 
our algorithm is based on external sorting. Decisions of the split strategy can be 
made according to a sample of the data set which is selected automatically. The sort 
algorithm is a variant of the well-known Quicksort algorithm, enhanced to work on 
secondary storage. The index construction has a runtime complexity of 0(n log n). 

We show both analytically and experimentally that the algorithm outperforms tradi- 
tional index construction methods by large factors. 

1. Introduction 

Efficient index construction in multidimensional data spaces is important for many knowl- 
edge discovery tasks. Many algorithms for knowledge discovery [JD 88, KIR 90, NH 94, 
EKSX 96, BBBK 99], especially clustering algorithms, rely on efficient processing of sim- 
ilarity queries. In such a setting, multidimensional indexes are often created in a prepro- 
cessing step to knowledge discovery. If the index is not needed for general purpose query 
processing, it is not permanently maintained, but discarded after the KDD algorithm is 
completed. Therefore, the time spent in the index construction must be amortized by runt- 
ime improvements during knowledge discovery. Usually, indexes are constructed using 
repeated insert operations. This ‘dynamic index construction’ , however, causes a serious 
performance degeneration. We show later in this paper that in a typical setting, every insert 
operation leads to at least one access to a data page of the index. Therefore, there is an 
increasing interest in fast bulk-loading operations for multidimensional index structures 
which cause substantially fewer page accesses for the index construction. 

A second problem is that indexes must be carefully optimized in order to achieve a 
satisfactory performance (cf. [Boh 98, BK 99, BBJh- 99]). The optimization objectives 
[BBKK 97] depend on the properties of the data set (dimension, distribution, number of 
objects, etc.) and on the types of queries which are performed by the KDD algorithm 
(range queries [EKSX 96], nearest neighbor queries [KR 90, NH 94], similarity 
joins [BBBK 99], etc.). On the other hand, we may draw some advantage from the fact 
that we do not only know a single data item at each point of time (as in the dynamic index 
construction) but a large amount of data items. It is a common knowledge that a higher 
fanout and storage utilization of the index pages can be achieved by applying bulk-load 
operations. A higher fanout yields a better search performance. Knowing all data a 
priori allows us to choose an alternative data space partitioning. As we have shown in 
[BBK 98a], a strategy of splitting the data space into two equally-sized portions causes, 
under certain circumstances, a poor search performance in contrast to an unbalanced 
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split. Therefore, it is an important property of a bulk-loading algorithm that it allows to 
exchange the splitting strategy according to the requirements specific to the application. 

The currently proposed bulk-loading methods either suffer from poor performance in 
the index construction or in the query evaluation, or are not suitable for indexes which 
do not fit into main memory. In contrast to previous bulk-loading methods, we present 
in this paper an algorithm for fast index construction on secondary storage which pro- 
vides efficient query processing and is generic in the sense that the split strategy can be 
easily exchanged. It is based on an extension of the Quicksort algorithm which facili- 
tates sorting on secondary storage (cf. section 3.3 and 3.4). The split strategy (section 
3.2) is a user-defined function. For the split decisions, a sample of the data set is exploit- 
ed which is automatically generated by the bulk-loading algorithm. 

2. Related Work 

Several methods for bulk-loading multidimensional index structures have been proposed. 
Space-filling curves provide a means to order the points such that spatial neighborhoods 
are maintained. In the Hilbert R-tree construction method [KF 94], the points are sorted 
according to their Hilbert value. The obtained sequence of points is decomposed into con- 
tiguous subsequences which are stored in the data pages. The page region, however, is not 
described by the interval of Hilbert values but by the minimum bounding rectangle of the 
points. The directory is built bottom up. The disadvantage of Hilbert R- trees is the high 
overlap among page regions. 

VAM-Split trees [JW 96], in contrast, use a concept of hierarchical space partitioning 
for bulk-loading R-trees or KDB-trees. Sort algorithms are used for this purpose. This 
approach does not exploit a priori knowledge of the data set and is not adaptable. 

Buffer trees [BSW 97] are a generalized technique to improve the construction perfor- 
mance for dynamic insert algorithms. The general idea is to collect insert operations to 
certain branches of the tree in buffers. These operations are propagated to the next deep- 
er level whenever such a buffer overflows. This technique preserves the properties of the 
underlying index structure. 



3. Our New Technique 

During the bulk-load operation, the complete data set is held on secondary storage. Al- 
though only a small cache in the main memory is required, cost intensive disk operations 
such as random seeks are minimized. In our algorithms, we strictly separate the split strat- 
egy from the core of the construction algorithm. Therefore, we can easily replace the split 
strategy and thus, create an arbitrary overlap-free partition for the given storage utilization. 
Various criteria for the choice of direction and position of split hyperplanes can be applied. 
The index construction is a recursive algorithm consisting of the following subtasks: 

• determining the tree topology (height, fanout of the directory nodes, etc.) 

• choice of the split strategy 

• external bisection of the data set according to tree topology and split strategy 

• construction of the index directory. 



3.1 Determination of the Tree Topology 

The first step of our algorithm is to determine the topology of the tree resulting from our 
bulk-load operation. The height of the tree can be determined as follows [Boh 98]: 



h = 






-) 



eff.data 



H- 1 
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Figure 1: The Split Tree. 



The fanout is given by the following formula: 



fanout(/i, n) = min( - ” h-2 ’ ^^ax.dir) 

^eff.data ’ ^eff.dir 

3.2 The Split Strategy 

In order to determine the split dimension, we have to consider two cases: If the data subset 
fits into main memory, the split dimension and the subset size can be obtained by comput- 
ing selectivities or variances from the complete data subset. Otherwise, decisions are based 
on a sample of the subset which fits into main memory and can be loaded without causing 
too many random seek operations. We use a simple heuristic to sample the data subset 
which loads subsequent blocks from three different places in the data set. 

3.3 Recursive Top-Down Partitioning 

Now, we are able to define a recursive algorithm for partitioning the data set. The algorithm 
consists of two procedures which are nested recursively (both procedures call each other). 
The first procedure, partition ( ), that is called once for each directory page has the following 
duties: 

• call the topology module to determine the fanout of the current directory page 

• call the split-strategy module to determine a split tree for the current directory page 

• call the second procedure, partition_acc_to_split_tree( ) 

The second procedure partitions the data set according to the split dimensions and the 
proportions given in the split tree. However, the proportions are not regarded as fixed 
values. Instead, we will determine lower and upper bounds for the number of objects on 
each side of the split hyperplane. This will help us to improve the performance of the next 
step, the external blpartitionlng. Let us assume that the ratio of the number of leaf nodes on 
each side of the current node in the split tree is / : r, and that we are currently dealing with 
Ndata objects. An exact split hyperplane would exploit the proportions: 






N-N,^ 



Instead of using the exact values, we compute an upper bound for such that is not 
too large to be placed In / subtrees with height h - I and a lower bound for Ajgfj such that 
Alight is not too large for r subtrees: 

^maxleft = ^ ’ <^max,tree(^ " 1) ^min.left = - 1) 

An overview of the algorithm is depicted in C-like pseudocode in figure 2. For the presen- 
tation of the algorithm, we assume that the data vectors are stored in an array on secondary 
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index_construction (int n) 

{ 

int h = (int){log (n/Ceffdata) / log (Ceffdir) + 1) ; 
partition (0, n, h) ; 

} 

partition (int start, int n, int height) 

{ 

if (height == 0) { 

... // write data page, propagate info to parent 
return ; 

} 

int f = fanout (height, n) ; 

SplitTree st = split_strategy (start, n, f) ; 
partition_acc_to_splittree (start, n, height, st) ; 

... // write directory page, propagate info to parent 

} 



partition_acc_to_splittree (int start, int n, int height, SplitTree st) 
{ 

if (isjeaf (st)) { 

partition (start, n, height - 1) ; 
return ; 



} 



} 

int mtc = max_tree_capacity (height - 1) ; 
n_maxleft = st->l_leaves * mtc ; 
n_minleft = N - st->r_leaves * mtc ; 
n_real = external_bipartition (start, n, st->splitdim, 
n_minleft, n_maxleft) ; 
partition_acc_to_splittree (start, n_real, 
st->leftchild, height) ; 

partition_acc_to_splittree (start + n_real, n - n_real, 
st->rightchild, height) ; 



Figure 2: Recursive Top-Down Data Set Partitioning. 

Storage and that the current data subset is referred to by the parameters start and n, where 
n is the number of data objects and start represents the address of the first object. 

The procedure index_construction(n) determines the height of the tree and calls par- 
tition( ) which is responsible for the generation of a complete data or directory page. The 
function partition() first determines the fanout of the current page and calls 
split_strategy( ) to construct an adequate split tree. Then partition_acc_to_splittree{) is 
called to partition the data set according to the split tree. After partitioning the data, 
partition_acc_to_splittree() calls partition() in order to create the next deeper index 
level. The height of the current subtree is decremented in this indirect recursive call. 
Therefore, the data set is partitioned in a top-down manner, i.e. the data set is first parti- 
tioned with respect to the highest directory level below the root node. 

3.4 External Bipartitioning of the Data Set 

Our bipartitioning algorithm is comparable to the well-known Quicksort algorithm 
[Hoa 62, Sed 78]. Bipartitioning means to split the data set or a subset into two portions 
according to the value of one specific dimension, the split dimension. After the bipartition- 
ing step, the “lower” part of the data set contains values in the split dimension which are 
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lower than a threshold value, the split value. The values in the “higher” part will be higher 
than the split value. The split value is initially unknown and is determined during the run 
of the bipartitioning algorithm. 

Bipartitioning is closely related to sorting the data set according to the split dimension. 
In fact, if the data is sorted, bipartitioning of any proportion can easily be achieved by 
cutting the sorted data set into two subsets. However, sorting has a complexity of 
o(n log n), and a complete sort-order is not required for our purpose. Instead, we will 
present a bipartitioning algorithm with an average-case complexity of 0(n). The basic 
idea of our algorithm is to adapt Quicksort as follows: Quicksort makes a bisection of 
the data according to a heuristically chosen pivot value and then recursively calls Quick- 
sort for both subsets. Our first modification is to make only one recursive call for the 
subset which contains the split interval. We are able to do that because the objects in the 
other subsets are on the correct side of the split interval anyway and need no further 
sorting. The second modification is to stop the recursion if the position of the pivot value 
is inside the split interval. The third modification is to choose the pivot values according 
to the proportion rather than to reach the middle. 

Our bipartitioning algorithm works on secondary storage. It is well-known that the 
Mergesort algorithm is better suited for external sorting than Quicksort. However, 
Mergesort does not facilitate our modifications leading to an 0(n) complexity and was 
not further investigated for this reason. In our implementation, we use a sophisticated 
scheme reducing disk I/O and especially reducing random seek operations much more 
than a normal caching algorithm would be able to. 

The algorithm can run in two modes, internal or external, depending on the question 
whether the processed data set fits into main memory or not. The internal mode is quite 
similar to Quicksort: The middle of three split attribute values in the database is taken as 
pivot value. The first object on the left side having a split attribute value larger than the 
pivot value is exchanged with the last element on the right side smaller than the pivot 
value until left and right object pointers meet at the bisection point. The algorithm stops 
if the bisection point is inside the goal interval. Otherwise, the algorithm continues 
recursively with the data subset containing the goal interval. 

The external mode is more sophisticated: First, the pivot value is determined from the 
sample which is taken in the same way as described in section 3.2 and can often be 
reused. A complete internal blpartition runs on the sample data set to determine a suit- 
able pivot value. In the following external bisection (cf. figure 3), transfers from and to 
the cache are always processed with a blocksize half of the cache size. Figure 3a shows 
the initialization of the cache from the first and last block in the disk file. Then, the data 
in the cache is processed by internal bisection with respect to the pivot value. If the 
bisection point is in the lower part of the cache (figure 3c), the right side contains more 
objects than fit into one block. One block, starting from the bisection point, is written 
back to the file and the next block is read and internally bisected again. Usually, objects 
remain in the lower and higher ends of the cache. These objects are used later to fill up 
transfer blocks completely. All remaining data is written back in the very last step into 
the middle of the file where additionally a fraction of a block has to be processed. Final- 
ly, we test if the bisection point of the external bisection is in the split interval. If the 
point is outside, another recursion is required. 
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(a) Initializing the cache from file: 
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(h) Internal bisection of the cache: 
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(c) Writing the larger half partially back to disk: 
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(d) Loading one further block to cache: 
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(e) Writing the larger half partially back to disk: 
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Figure 3: External Bisection. 

3.5 Constructing the Index Directory 

As data partitioning is done by a recursive algorithm, the structure of the index is represent- 
ed hy the recursion tree. Therefore, we are able to create a directory node after the comple- 
tion of the recursive calls for the child nodes. These recursive calls return the hounding 
boxes and the corresponding secondary storage addresses to the caller, where the informa- 
tion is collected. There, the directory node is written, the bounding boxes are combined to 
a single bounding box comprising of all boxes of child nodes, and the result is again prop- 
agated to the next higher level. A depth-first post-order sequentialization of the index is 
written to the disk. 



3.6 Analytical Evaluation of the Construction Algorithm 

In this section, we will show that our bottom-up construction algorithm has an average case 
time complexity of 0(n log «). Moreover, we will consider disk accesses in a more exact 
way, and thus provide an analytically derived improvement factor over the dynamic index 
construction. For the file FO, we determine two parameters: The number of random seek 
operations and the amount of data read or written from or to the disk. Unless no further 
caching is performed (which is true for our application, but cannot be guaranteed for the 
operating system) and provided that seeks are uniformly distributed variables, the FO pro- 
cessing time can be determined as 

hlo = ^seek • seek.ops + ■ amount . 

In the following, we denote by the cache capacity the number of objects fitting into 
the cache: 

_ _ cachesize 

cache sizeof (object) 
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Lemma 1. Complexity of bisection 

The bisection algorithm has the complexity 0{n). 

Proof (Lemma 1) 

We assume that the pivot element is randomly chosen from the data set. After the first run 
of the algorithm, the pivot element is located with uniform probability at one of the n 
positions in the file. Therefore, the next run of the algorithm will have the length k with a 
probability 1 /n for each \ <k<n . Thus, the cost function C(n) encompasses the cost 
for the algorithm, n + \ comparison operations plus a probability weighted sum of the 
cost for processing the algorithm with length k-\ , C{k) . We obtain the following 
recursive equation: 



□ 



C(n) = n + \ + 



I 



C{k- 1) 
n 



k = 1 

which can be solved by multiplying with n and subtracting the same equation for « - 1 . 
This can be simplified to C(m) = 2 H- C(« - 1) , and, C(«) = 2 ■ n = 0{n) . 



Lemma 2. Cost Bounds of Recursion 

(1) The amount of data read or written during one recursion of our technique does not 
exceed four times the file-size. 

(2) The number of seek operations required is bounded by 

seek_ops(n) < — — + 2 • log 2 (n) 

^ cache 

Proof (Lemma 2) 

(1) follows directly from Lemma 1 because every compared element has to be transferred 
at most once from disk to main memory and at most once back to disk. 

(2) In each run of the external bisection algorithm, file I/O is processed with a blocksize 
of cachesize/2. The number of blocks read in each run is therefore 

blocks_readbi,,,ji„„(«) = ^ + 1 

because one extra read is required in the final step. The number of write operations is the 
same and thus 



□ 



^interval g _ ^ 

seek_ops(«) = 2 • ^ blocks_readj,^j^(/) < — h 2 • log 2 (n) . 

i - 0 cache 



Lemma 3. Average Case Complexity of Our Technique 

Our technique has an average case complexity of 0(n log n) unless the split strategy has 
a complexity worse than 0(n). 

Proof (Lemma 3) 

For each level of the tree, the complete data set has to be blsectioned as often as the height 
of the split tree indicates. As the height of the split tree is determined by the directory 
page capacity, there are at most 

^(«) • Cmax,dir = 0(log n) 

bisection runs necessary. Therefore, our technique has the complexity 0(n log n). 
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Figure 4: Improvement Factor for the Index Construction According to Lemmata 1-5. 



Lemma 4. Cost of Symmetric Partitioning 

For symmetric splitting, the procedure partition( ) handles an amount of file I/O data of 

( log 2 ( ^ ” ) + logc • 4 • filesize 

V C u '-"max, dir C u / 

^ cache cache ^ 

and requires 



/ 

l0g2 

V 



c. 



cache 



-) + log 



c H ^7^ 

^max,dir O 



cache 



f J— ^ + 2 • log2(n) 

cache 



random seek operations. 

Proof (Lemma 4) 



Left out due to space limitations, cf. [Boh 98]. 



Lemma 5. Cost of Dynamic Index Construction 

Dynamic X-tree construction requires 2 n seek operations. The transferred amount of 
data is 2 • M • pagesize . 

Proof (Lemma 5) 

For the X-tree, it is generally assumed that the directory is completely held in main 
memory. Data pages are not cached at all. For each insert, the corresponding data page 
has to be loaded and written back after completing the operation. 

□ 



Moreover, no better caching strategy for data pages can be applied, since without prepro- 
cessing of the input data set, no locality can be exploited to establish a working set of pages. 
From the results of lemmata 4 and 5 we can derive an estimate for the improvement factor 
of the bottom-up construction over dynamic index constmction. The improvement factor 
for the number of seek operations is approximately: 

Improvement = — 

4- 

It is almost (up to the logarithmic factor in the denominator) linear in the cache capacity. 
Figure 4 depicts the improvement factor (number of random seek operations) for varying 
cache sizes and varying database sizes. 
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4. Experimental Evaluation 

To show the practical relevance of our bottom-up construction algorithm, we have per- 
formed an extensive experimental evaluation by comparing the following index construc- 
tion techniques: Dynamic index construction (repeated insert operations), Hilbert R-tree 
construction and our new method. All experiments have been computed on HP9000/780 
workstations with several GBytes of secondary storage. Although our technique is applica- 
ble to most R-tree-like index structures, we decided to use the X-tree as an underlying 
index structure because according to [BKK 96], the X-tree outperforms other high-dimen- 
sional index structures. All programs have been implemented in C-H-. 

In our experiments, we compare the construction times for various indexes. The exter- 
nal sorting procedure of our construction method was allowed to use only a relatively 
small cache (32 kBytes). Note that, although our implementation does not provide any 
further disk I/O caching, this cannot be guaranteed for the operating system. In contrast, 
the Hilbert construction method was implemented with internal sorting for simplicity. 
The construction time of the Hilbert method is therefore underestimated by far and 
would worsen in combination with external sorting when the cache size is strictly limit- 
ed. All Hilbert-constructed indexes have a storage utilization near 100%. 

Figure 5 shows the construction time of dynamic index constrnction and of the bot- 
tom-up methods. In the left diagram, we fix the dimension to 16, and vary the database 
size from 100,000 to 2,000,000 objects of synthetic data. The resulting speed-up of the 
bulk-loading techniqnes over the dynamic construction was so enormons that a logarith- 
mic scale must be used in fignre 5. In contrast, the bottom-up methods differ only slight- 
ly in their performance. The Hilbert technique was the best method, having a construc- 
tion time between 17 and 429 sec. The construction time of symmetric splitting ranges 
from 26 to 668 sec., whereas unbalanced splitting required between 21 and 744 sec. in 
the moderate case and between 23 and 858 sec. for the 9: 1 split. In contrast, the dynamic 
construction time ranged from 965 to 393,310 sec. (4 days, 13 hours). The improvement 
factor of our methods constantly increases with growing index size, starting from 37 to 
45 for 100,000 objects and reaching 458 to 588 for 2,000,000 objects. The Hilbert con- 
struction is up to 915 times faster than the dynamic index construction. This enormous 
factor is not only due to internal sorting but also due to reduced overhead in changing the 
ordering attribute. In contrast to Hilbert construction, our technique changes the sorting 
criterion during the sort process according to the split tree. The more often the sorting 
criterion is changed, the more unbalanced the split becomes because the height of the 
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Figure 5: Performance of Index Construction Against Database Size and Dimension. 
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split tree increases. Therefore, the 9: 1-split has the worst improvement factor. The right 
diagram in figure 5 shows the construction time for varying index dimensions. Here, the 
database size was fixed to 1,000,000 objects. It can be seen that the improvement factors 
of the construction methods (between 240 and 320) are rather independent from the 
dimension of the data space. 

Our further experiments, which are not presented due to space limitations [Boh 98], 
show that the Hilbert construction method yields a bad performance in query process- 
ing. The reason is the high overlap among the page regions. Due to improved space 
partitioning resulting from knowing the data set a priori, the indexes constructed by our 
new method outperform even the dynamically constructed indexes by factors up to 16.8. 

5. Conclusion 

In this paper, we have proposed a fast algorithm for constructing indexes for high-dimen- 
sional data spaces on secondary storage. A user-defined split- strategy allows the adapta- 
tion of the index properties to the requirements of a specific knowledge discovery algo- 
rithm. We have shown both analytically and experimentally that our construction method 
outperforms the dynamic index construction by large factors. Our experiments further 
show that these indexes are also superior with respect to the search performance. Future 
work includes the investigation of various split strategies and their impact on different 
query types and access patterns. 
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Abstract. Efficient query processing is one of the basic needs for data mining 
algorithms. Clustering algorithms, association rule mining algorithms and OLAP 
tools all rely on efficient query processors being able to deal with high-dimension- 
al data. Inside such a query processor, multidimensional index structures are used 
as a basic technique. As the implementation of such an index structures is a diffi- 
cult and time-consuming task, we propose a new approach to implement an index 
structure on top of a commercial relational database system. In particular, we map 
the index structure to a relational database design and simulate the behavior of the 
index structure using triggers and stored procedures. This can easily be done for a 
very large class of multidimensional index structures. To demonstrate the feasibil- 
ity and efficiency, we implemented an X-tree on top of Oracle 8. We ran several 
experiments on large databases and recorded a performance improvement of up to 
a factor of 11.5 compared to a sequential scan of the database. 

1. Introduction 

Efficient query processing in high-dimensional data spaces is an important require- 
ment for many data analysis tools. Algorithms for knowledge discovery tasks such as 
clustering [EKSX 98], association rule mining [AS 94], or OLAP [HAMS 97], are often 
based on range search or nearest neighbor search in multidimensional feature spaces. 
Since these applications deal with large amounts of usually high-dimensional point data, 
multidimensional index structures must be applied for the data management in order to 
achieve a satisfactory performance. 

Multidimensional index structures have been intensively investigated during the last 
decade. Most of the approaches [Gut 84, LS 89] were designed in the context of geo- 
graphical information systems where two-dimensional data spaces are prevalent. The 
performance of query processing often deteriorates when the dimensionality increases. 
To overcome this problem, several specialized index structures for high-dimensional 
query processing have been proposed that fall into two general categories: One can 
either solve the d-dimensional problem by designing a c/-dimensional index. Examples 
are the TV-tree [LIE 95], the SS-tree [WJ 96], the SR-tree [KS 97] or the X-tree 
[BKK 96]. We refer to this class of indexing techniques as multidimensional indexes. 
Alternatively, one can map the J-dimensional problem to an equivalent 1 -dimensional 
problem and then make use of an existing 1-dimensional index such as a B“''-tree. Thus, 
we provide a mapping that maps each d-dimensional data point into a 1 -dimensional 
value (key). We refer to this class of indexing techniques as mapping techniques. Exam- 
ples for this category are the Z-order [FB 74], the Hilbert-curve [FR 89, Jag 90], Gray- 
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Codes [Fal 85], or the Pyramid-tree [BBK 98]. We refer to [Boh 98] for a comprehen- 
sive survey on the relevant techniques. 

Recently, there is an increasing interest in integrating high-dimensional point data into 
commercial database management systems. Data to be analyzed often stem from pro- 
ductive environments which are already based on relational database management sys- 
tems. These systems provide efficient data management for standard transactions such 
as billing and accounting as well as powerful and adequate tools for reports, spread- 
sheets, charts and other simple visualization and presentation tools. Relational databas- 
es, however, fail to manage high-dimensional point data efficiently for advanced knowl- 
edge discovery algorithms. Therefore, it is common to store productive data in a 
relational database system and to replicate the data for analysis purposes outside the 
database in file-based multidimensional index structures. We call this approach the hy- 
brid solution. 

The hybrid solution bears various disadvantages. Especially the integrity of data 
stored in two ways, inside and outside the database system, is difficult to maintain. If an 
update operation involving both, multidimensional and productive data fails in the rela- 
tional database (e.g. due to concurrency conflicts), the corresponding update in the mul- 
tidimensional index must be undone to guarantee consistency. Vice versa, if the multidi- 
mensional update fails, the corresponding update to the relational database must be 
aborted. For this purpose, a two-phase commit protocol for heterogeneous database sys- 
tems must be implemented, a time-consuming task which requires a deep knowledge of 
the participating systems. The hybrid solution involves further problems. File systems 
and database systems usually have different concepts for data security, backup and con- 
current access. File-based storage does not guarantee physical and logical data indepen- 
dence. Thus, schema evolution in “running” applications is difficult. 

A promising approach to overcome these disadvantages is based on object-relational 
database systems. Object-relational database systems are relational database systems 
which can be extended by application-specific data types (called data cartridges or data 
blades). The general idea is to define data cartridges for multidimensional attributes and 
to manage them in the database. For data-intensive applications it is necessary to imple- 
ment multidimensional index structures in the database. This requires the access to the 
block-manager of the database system, which is not granted by most commercial data- 
base systems. The current universal server by ORACLE, for instance, does not provide 
any documentation of a block-oriented interface to the database. Data cartridges are 
only allowed to access relations via the SQL interface. Current object-relational data- 
base systems are thus not very helpful for our integration problem. 

We can summarize that using current object-relational database systems or pure rela- 
tional database systems, the only possible way to store multidimensional attributes in- 
side the database is to map them into the relational model. 

In this paper, we propose a technique which allows a direct mapping of the concepts 
of specialized index structures for high-dimensional data spaces into the relational mod- 
el. Eor concreteness, we concentrate here on a relational implementation of the X-tree 
on top of Oracle-8. The X-tree, an R-tree variant for high-dimensional data spaces, is 
described in detail in section 4.1. The presented techniques, however, can also be ap- 
plied to other indexing approaches such as the TV-Tree [LJF 95] or the SS-Tree 
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[WJ 96]. Similarly, the underlying database system can be exchanged using the same 
concept we suggest. The general idea is to model the structure of the relevant compo- 
nents of the index (such as data pages, data items, directory pages etc.) in the relational 
model and to simulate the query processing algorithms defined on these structures using 
corresponding SQL statements. 

The simulation of mapping techniques is pretty straightforward and is therefore not 
explained in depth in this paper. One just stores the 1 -dimensional value in an additional 
column of the data table and then searches this column. Obviously, a database index is 
used to support the search. Thus, the whole query process is done in three steps: 

1. compute a set of candidates based on the 1 -dimensional key 

2. refine this set of candidates based on the d-dimensional feature vectors 

3. refine this set of candidates by looking up the actual data items 

2. Simulation of Hierarchical Index Structures 

The implementation of hierarchical index structures is much more complex than the 
implementation of mapping techniques. This applies to any implementation strategy. 
The reason for this is that hierarchical index structures have a complex structure that 
dynamically changes when inserting new data items. Thus, algorithms do not run on a 
previously given structure and have to be implemented recursively. To demonstrate that 
even in this complex scenario, an implementation of an index structure on top of a 
commercial database system can be done relatively easy and is preferable compared to 
a legacy implementation, we implemented the X-tree, a high-dimensional index struc- 
ture, based on R-trees. 

2.1 Simulation 

The basic idea of our technique is to simulate the X-tree within the relational schema. 
Thus, we keep a separate table for each level of the tree. One of those tables stores the 
data points (simulating the data pages) the other tables store minimum bounding boxes 
and pointers (simulating the directory pages). Figure 1 depicts this scenario. In order to 
insert a data item, we first determine the data page in which the item has to be inserted. 
Then, we check whether the data page overflows and if it does, we split the page accord- 
ing to the X-tree split strategy. Note that a split might also cause the parent page in the 
directory to overflow. If we have to split the root node of the tree which causes the tree 
to grow in height, we have to introduce an additional table ^ and thus change the schema. 
A practical alternative is to pre-define tables for a three or four level directory. As only 
in case of very large databases, an X-tree grows beyond height four, by doing so we can 
handle a split of the root node as an exception that has to be handled separately. Thus, 
the schema of the tree becomes static. All these actions are implemented in stored pro- 
cedures. 

In order to search the tree, we have to join all tables and generate a single SQL state- 
ment that queries the entire tree. This statement has to be created dynamically whenever 



1. A technical problem arises here when dealing with commercial database systems: Oracle 8, for 
instance, ends a transaction whenever a DDL command is executed. This means that if we use Ora- 
cle 8, an insert operation on a tree that caused the root node to be split cannot be undone by simply 
aborting the current transaction. 
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Figure 1: Relational Schema (Including B'''-Tree Indexes) of the X-Tree. 

the schema of the X-tree changes due to tree growth. If we process range queries, the 
SQL statement is rather simple. The details are provided in section 2.3. 

Relational Schema 

All information usually held in the data pages of the X-tree is modeled in a relation 
called DATA. A tuple in DATA contains a t/-dimensional data vector, which is held in a 
set of d numerical attributes Xq, Xj, ..., x^.j, a unique tuple identifier {tid), and the page 
number (pn) of the data page. Thus, DATA has the schema “DATA (xq FLOAT, xj 
FLOAT, ..., Xd_i FLOAT, tid NUMBER NOT NULL, pn NUMBER NOT NULL)”. In- 
tuitively, all data items located in the same data page of the X-Tree share the same value 
pn. 

The k levels of the X-tree directory are modeled using k relations DIRECTORYq, ..., 
DIRECTORYj,.j. Each tuple in a relation DIRECTORY; belongs to one entry of a direc- 
tory node in level i consisting of a bounding box and a pointer to a child node. Therefore, 
DIRECTORY; is of the scheme “DIRECTORY; {Ib^ FLOAT, ub^ FLOAT, ...., Ib^.^ 
FLOAT, FLOAT, child NUMBER NOT NULL, pn NUMBER NOT NULL)“. The 
additional attribute child represents the pointer to the child node which, in case of DI- 
RECTORYj;,.;, references a data page and pn identifies the directory node the entry 
belongs to. Thus, the two relations DIRECTORY^;,.; and DATA can be joined via the 
attributes child and pn which actually form a 1 :«-relationship between DIRECTORY^;,.; 
and DATA. The same relationship exists for two subsequent directory levels DIRECTO- 
RY; and DIRECTORY;^.;. Obviously, it is important to make the join between two sub- 
sequent levels of the directory efficient. To facilitate index-based join methods, we cre- 
ate indexes using the pn attribute as the ordering criterion. The same observation holds 
for the join between DIRECTORY^;,.; and DATA. To save table accesses, we also added 
the quantized version of the feature vectors to the index. The resulting relational schema 
of the X-Tree enhanced by the required indexes (triangles) is depicted in Figure 1 . 

Compressed Attributes 

If we assume a high-dimensional data space, the location of a point in this space is 
defined in terms of d floating point values. If d increases, the amount of information, we 
are keeping, also increases linearly. Intuitively however, it should be possible to keep the 
amount of information stored for a single data item almost constant for any dimension. 
An obvious way to achieve this is to reduce the number of bits used for storing a single 
coordinate linearly if the number of coordinates increases. In other words, as in a high- 
dimensional space we have so much information about the location of a point, it should 
be sufficient to use a coarser resolution to represent the data space. This technique suc- 
cessfully has been applied in the VA-file [WSB 98] to compute nearest neighbors. In the 
VA-file, a compressed version of the data points is stored in one file and the exact data 
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is stored in another file. Both files are unsorted, however, the ordering of the points in 
the two files is identical. Query processing is equivalent to a sequential scan of the 
compressed file with some look-ups to the second file whenever this is necessary. In 
particular a look-up occurs, if a point cannot be pruned from the nearest neighbor search 
only based on the compressed representation. 

In our implementation of the X-tree, we suggest the similar technique of compressed 
attributes. A compressed attribute summarizes the d-dimensional information of an en- 
try in the DATA table in a single-value representation. Thus, the resolution of the data 
space is reduced to 1 byte per coordinate. Then, the 1-byte coordinates are concatenated 
and stored in a single attribute called comp. Thus the scheme of DATA changes to DATA 
(REAL Xq, real Xj, REAL x^j.j, RAW[<i] comp, INT tid, INT pn). To guarantee an 
efficient access to the compressed attributes, we store comp in the index assigned to 
DATA. Thus, in order to exclude a data item from the search, we first can use com- 
pressed representation of the data item stored in the index and only if this is not suffi- 
cient, we have to make a look-up to the actual DATA table. This further reduces the 
number of accesses to the DATA table because most accesses are only to the index. 

2.2 Index Creation 

There are two situations when one intends to insert new data into an index structure: 
Inserting a single data item, and building an index from scratch given a large amount of 
data (bulk-load). We are supposed to handle these two situations separately, due to effi- 
ciency considerations. The reason for this is that a dynamic insert of a single data item 
is usually relatively slow, however, knowing all the data items to be inserted in advance, 
we are able to preprocess the data (e.g. sort) such that an index can be built very effi- 
ciently. This applies to almost all multidimensional index structures and their imple- 
mentations. 

The dynamic insertion of a single data item involves two steps: determining an inser- 
tion path and, when necessary, a local restructuring of the tree. There are basically two 
alternatives for the implementation: An implementation of the whole insert algorithm 
(e.g. using embedded SQL), or directly inserting the data point into the DATA relation 
and then to raise triggers which perform the restructuring operation. 

In any implementation, we first have to determine an appropriate data page to insert 
the data item. Therefore, we recursively look-up the directory tables as we would handle 
it in a legacy implementation. Using a stored procedure, we load all affected node entries 
into main memory and process them as described above. Then, we insert the data item 
into the page. In case of an overflow, we recursively update the directory, according to 
[BKK 96]. 

If an X-Tree has to be created from scratch for a large data set, it is more efficient to 
provide a bulk-load operation, such as proposed in [BBK 98a]. This technique can also 
be implemented in embedded SQL or stored procedures. 

2.3 Processing Range Queries 

Processing a range query using our X-Tree implementation with a k-level-directory 
involves (k + 2) steps. The first step reads the root level of the directory 
(DIRECTORYq) and determines all pages of the next deeper level (DIRECTORY Q 
which are intersected by the query window. These pages are loaded in the second step 
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SELECT data.* 




FROM directory 0 dirg, directory 1 dirj, data 


WHERE 




/* JOIN */ 


dirQ.child = dirj.pn 


AND 


dirj .child = data.pn 


/* 1st step*/ 




AND 


dirQ.lbo < qubgANDqlbg < dirg.ubg 


AND 




AND 


diro-lbrf.i < qub,;.iANDqlb,;.l < dirg.ubj.i 


/* 2nd step */ 




AND 


dirj.lbg < qubgANDqlbg < dhj.ubg 


AND 




AND 


dhi-lVl ^ qubrf.lANDqlb,,.! < dirj.ub,;.! 


/* 3rd step */ 




AND 


ASCII(SUBSTR(data.comp,l,l)) BETWEEN qclbg and qcubg 


AND 




AND 


ASCII(SUBSTR(data.comp,l,l)) BETWEEN qclb^^.i and qcub,;.j 


/* 4th step */ 




AND 


data.Xg BETWEEN qlbg AND qubg 


AND 




AND 


data.x,;_[ BETWEEN qlb,;_j AND qub,;_j 



Figure 2: An Example for an SQL Statement Processing a Range Query, 
and used for determining the qualifying pages in the subsequent level. The following 
steps read all k levels of the directory in the same way, thus filtering between pages 
which are affected or not. Once the bottom level of the directory has been processed, the 
page numbers of all qualifying data pages are known. The data pages in our implemen- 
tation contain the compressed (i.e. quantized) versions of the data vectors. Step number 
{k+ 1), the last filter step, loads these data pages and determines candidates (a candidate 
is a point whose quantized approximation is intersected by the query window). In the 
refinement step {k + 2), the candidates are directly accessed (the position in the data file 
is known) and tested for containment in the query window. 

In our relational implementation, all these steps are comprised in a single SQL state- 
ment (c.f. Figure 2 for a 2-level directory). It forms an equi-join between each pair of 
subsequent directory levels (DIRECTORY, and DIRECTORY^ j, 0 <y < A: - 2) and an 
additional equi-join between the last directory level DIRECTORY,- and the DATA rela- 
tion. It consists of (k + 2) AND-connected blocks in the WHERE-clause. The blocks 
refer to the steps of range query processing as described above. Eor example, the first 
block filters all page numbers of the second directory level qualifying for the query. 
Block number (k+ 1) contains various substring-operations. The reason is that we had 
to pack the compressed attributes into a string due to restrictions on the number of 
attributes which can be stored in an index. The last block forms the refinement step. 
Note that it is important to translate the query into a single SQL statement, because 
client-/server communication involving costly context switches or round-trip delays can 
be clearly reduced. 
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The particularity of our approach is that processing of joins between (A: + 1) tables is 
more efficient than a single scan of the data relation provided that the SQL statement is 
transformed into a suitable query evaluation plan (QEP). This can be guaranteed by 
hints to the query optimizer. Query processing starts with a table scan of the root table. 
The page regions intersecting the query window are selected and the result is projected 
to the foreign key attribute child. The value of this result is used in the index join to 
efficiently search for the entries in DIRECTORY j which are contained in the corre- 
sponding page region. For this purpose, an index range scan is performed. The corre- 
sponding directory entries are retrieved by internal-key-accesses on the corresponding 
base table DIRECTORY j . The qualifying data page numbers are again determined by 
selection and projection to the c/z/W-attribute. An index range scan similar to the index 
scan above is performed on the index of the DATA-table containing the page number, 
and the quantized version of the data points. Before accessing the exact representation 
of the data points, a selection based on the compressed attribute is performed to deter- 
mine a suitable candidate set. The last step is the selection based on the exact geometry 
of the data points. 

3. Experimental Evaluation 

In order to verify our claims that the suggested implementation of multidimensional 
index structures does not only provide advantages from a software engineering point of 
view but also in terms of performance, we actually implemented the X-tree on top of 
Oracle 8 and performed a comprehensive experimental evaluation on both, synthetic 
and real data. Therefore, we compared various query processing techniques for high- 
dimensional range queries in relational databases: 

1 . sequential scan on the data relation, 

2. sequential scan on the data relation using the COMPRESSED attributes tech- 
nique 

3. standard index (B-tree) on the first attribute 

4. standard index on all attributes concatenated in a single index 

5. standard indexes on each attribute (inverted-list approach) 

6. X-tree-simulation with and without COMPRESSED attributes technique 



« X-tree (relatnnal implement) 

— ■ — Construction of standard B-tree 



Diimenslon 

Figure 3: Times to Create an X-tree in Oracle 8. 
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^ W= 100,000 




— • — Sequ. Scan 
— ■ — Sid Index 
^ Compressnn 
Drectory 
— Conpr + Dir 



Dimension 

Figure 4: Performance for Range Queries (Synthetic Data) for Varying Dimensions. 



As first experimental results show, the variants 4 and 5 demonstrate a performance 
much worse than all other variants. We will therefore not show detailed results for these 
techniques. 

In our first experiment, we determine the times for creating an X-tree on a large data- 
base. Therefore, we bulk-load the index using different techniques. The results of this 
experiment are shown in figure 3. The relational implementation requires, depending on 
the dimensionality of the data set, between one and five minutes to build an index on a 
100,000 record 16-dimensional database. For this experiment, we use the algorithms 
described in [BBK 98a] for bulk-loading the X-tree caching intermediate results in a 
operating system file. The times for the standard B-tree approach and the X-tree ap- 
proach show that a standard B-tree can be built about 2.5 times faster. However, both 
techniques yield a good overall performance. 

In the next experiment, we compare the query performance of the different imple- 
mentations on synthetic data. The result of an experiment on 100,000 data items of 
varying dimensionality is presented in figure 4. The performance of the inverted lists 
approach and the standard index on a single attribute is not presented due to bad perfor- 
mance. It can be seen that both, the compressed attributes technique and the X-tree 
simulation yield high performance gains over all experiments. Moreover, the combina- 
tion of both these techniques outperforms the sequential scan and the standard index for 



30 





♦ Sequ. Scan 
-• — Std Index 
- Corrpreasion 
Drectory 
Conpr ♦ Or 



Figure 5: Performance for Range Queries (Synthetic Data) for Varying Database Size (a) 
and for Varying Selectivity (b). 
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all types of data over all dimensions. It can also be seen that the combination of the 
directory and the compressed attributes technique yields a much better improvement 
factor than each single technique. The factor even improves for higher dimensions, the 
best observed improvement factor in our experiments was 11.5. 

In the experiment depicted in figure 5 a, we investigate the performance of the imple- 
mentations when varying the size of the database. Again, the relational implementation 
of the X-tree with compressed attributes outperforms all other techniques by far. The 
acceleration even improves with growing database size. 

In the last experiment on real data, we investigate the performance for varying selec- 
tivities. The results of this experiment on 1,000,000 16-dimensional feature vectors are 
shown in figure 5b. The data comes from a similarity search system of a car manufactur- 
er and each feature vector describes the shape of a part. As we can observe from the 
chart, our technique outperforms all other techniques. The effect of the compressed 
attributes, however, was almost negligible. Thus, the performance of the X-tree with and 
without compressed attributes is almost identical. This confirms our claim that imple- 
menting index structures on top of a commercial relational database system shows very 
good performance for both, synthetic and real data. 

4. Conclusions 

In this paper, we proposed a new approach to implement an index structure on top of 
a commercial relational database system. We map the particular index structure to a 
relational database design and simulate the behavior of the index structure using triggers 
and stored procedures. We showed that this can be done easily for a very large class of 
multidimensional index structures. To demonstrate the feasibility and efficiency we im- 
plemented an X-tree on top of Oracle 8. We ran several experiments on large databases 
and recorded a performance improvement of up to a factor of 11.5 compared to a se- 
quential scan of the database. 

In addition to the performance gain, our approach has all the advantages of using a 
fully-fledged database system including recovery, multi-user support and transactions. 
Furthermore, the development times are significantly shorter than in a legacy imple- 
mentation of an index. 
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Abstract. Similarity or distance between objects is one of the central 
concepts in data mining. In this paper we consider the following problem: 
given a set of event sequences, define a useful notion of similarity between 
the different types of events occurring in the sequences. We approach 
the problem by considering two event types to be similar if they occur 
in similar contexts. The context of an occurrence of an event type is 
defined as the set of types of the events happening within a certain time 
limit before the occurrence. Then two event types are similar if their sets 
of contexts are similar. We quantify this by using a simple approach of 
computing centroids of sets of contexts and using the Li distance. We 
present empirical results on telecommunications alarm sequences and 
student enrollment data, showing that the method produces intuitively 
appealing results. 



1 Introduction 

Most data mining research has concentrated on set-oriented tabular data. There 
are, however, important types of data that do not fit within this framework. One 
such form of data are event sequenees that occur in many application areas. An 
event sequence is an ordered collection of events from a finite set of event types, 
with each event of the sequence having an occurrence time. See Fig. 1 for an 
example of an event sequence. 

A real-life example of an event sequence is the event or error log from a 
process such as telecommunications network management. Here the event types 
are the possible error messages, and the events are actual occurrences of errors at 
certain times. Also a web access log from a single session of a user can be viewed 
as an event sequence. Now the event types are the web pages, and an individual 
event is a request for a particular page at a particular time. Other examples 
of application areas in which event sequences occur are user interface design 
(event types are different user actions) , criminology (types of crime) , biostatistics 
(different symptoms), etc. In each of these applications, the data consists of one 
or several event sequences. Note that an event sequence is different from a time 
series in that a time series describes a variable with a continuous value over time, 
whereas an event sequence consists of discrete events. 
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AEF C DAE CDS EDAF BC EDB CF E 

30 35 40 45 50 55 60 time 

Fig. 1. An event sequence on the time axis. 

There has been some research on data mining methods for sequences of 
events: see, e.g., [10, 12-16]. In this paper we consider a problem related to ex- 
ploratory data analysis of event sequences. Suppose we are given a set of event 
sequences from an application domain about which only little domain knowl- 
edge is available. Then one interesting aspect is to gain understanding about the 
similarity between event types: which types of events are similar in some useful 
sense? 

Consider, for example, an application of studying web browsing behavior. 
In such an application we might be willing to determine two web pages to be 
similar if they provide the users with the same type of information. Finding this 
type of similarity in this application is probably quite easy. What is, however, 
more interesting is that would it be possible to find a same type of similarity 
information from the data alone? 

Recently, there has been considerable interest in defining intuitive and easily 
computable measures of similarity between complex objects and in using abstract 
similarity notions in querying databases [1-7, 11, 17, 19]. In this paper we describe 
a method for finding similarities between event types from large sets of event 
sequences. Such similarity information is useful in itself, as it provides insight 
into the data. Moreover, similarities between event types can be used in various 
ways to make querying the data set more useful. 

Our approach to defining similarity between different event types is based on 
the following simple idea: two event types are similar, if they occur in similar 
eontexts. That is, two event types A and B are similar, if the situations in which 
they occur in the sequences in some way resemble each other. Abstractly, we 
define the similarity between event types A and B by taking the sets of all 
contexts of A and B, and then computing the similarity between these sets of 
contexts. 

To formalize this intuitive idea, we need to answer several questions: (1) 
What is a context of an occurrence of an event type? (2) What does it mean 
that two sets of contexts are similar? An answer to the second question typically 
requires answering a simpler question: (3) Given two contexts, what is their 
distance? There are several ways of answering these questions, and the most 
suitable definition often depends on the application domain. In this paper we 
discuss different alternative answers for these questions and show that even sim- 
ple answers can yield experimental results that are interesting from the practical 
point of view. 

The rest of this paper is organized as follows. In Sect. 2 we define the basic 
concepts: event types, events, and event sequences. Then in Sect. 3 we discuss 
ways of defining the context of an occurrence of an event type, and in Sect. 4 we 
present notions for computing similarity between sets of contexts. Experimental 
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results are presented in Sect. 5. Section 6 discusses alternative methods and 
presents some conclusions. 

2 Basic Concepts 

A realistic way of modeling events is to consider a set i? = {Ai ,Af.] oi event 
attributes with domains Dom(Ai), . . . , Dom(Ai^). An event is a (A: + l)-tuple 
{a\, ... ,ak,t)i where a, € Dom(Ai) and t is a real number, the oeeurrenee time 
of the event. Then an event sequenee is a collection of events over i?U{T}, where 
the domain of the attribute T is the set of real numbers IR. The events in the 
event sequence are ordered by ascending occurrence times t. 

Example 1. In the telecommunications domain events have attributes such as 
alarm type, module, and severity, indicating the type of alarm, the module that 
sent the alarm, and the severity of the alarm, respectively. 

Example 2. In a web log events have attributes like page (the requested page), 
host (the accessing host), and the occurrence time of the request. 

Often it is enough to study a simplified model of events where the only event 
properties considered are the type and the occurrence time of the event. Let 
a set £ be the set of event types. Given this set £, an event is a pair (e,t) 
where e & £ is the type of the event and t € IR is the occurrence time of the 
event. Then an event sequenee S is an ordered collection of events, i.e., S = 
((ei,ti), (e 2 ,G), • • • , (e„,t„)), where e* e £, and ti < tj+i for alH = 1 , . . . , n - 1 . 
The length of the sequence S is, therefore, |5| = n. We can also consider even 
more simplified sequences that consist only of event types in temporal order. 
Such a sequence S = (ei, 62 , . . . , e„), where each e, € £, is called an event type 
sequenee. 

Example 3. Consider the event sequence in Fig. 1. A sequence 
5 = ((A, 30), (i?,31), (F,32), ..., {E,M)). 
is this same sequence presented as a sequence of (event type, time) -pairs. 

3 Contexts of Events 

Our basic idea is that the similarity or distance between two event types is 
determined by the similarity between contexts of occurrences of these event 
types. To make this intuition precise, we have to define what a context is and 
what it means that two sets of contexts are similar. In this section we address 
the first of the questions. 

We start with a simple definition. Consider an event sequence S = 
((ei,ti), ( 62 ,^ 2 ), • • • , (en,tn)), and let i € {1, . . . ,n} and IT be a time param- 
eter. Then a context of the event (ei,ti) in S is the set of all the types of the 
events that occur within IT time units before t,. That is, 

con{{ei,ti),S,W) = {ej \ (ej,tj) € S and ti —W <tj < ti }. 
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Given an event type A, the set of all the contexts of occurrences of A in 5 is 
defined as 

contexts(A,iS,hh) = {con((ej,tj),iS,hh) | & S and ei = A}. 

That is, contexts(A, S, W) is a collection of sets con((e,,tj),5, W) where e, = A. 
If there is no risk for confusion, we use an abbreviation contexts(A) for the set 
of contexts of the event type A. The size of the set of contexts is denoted by 
|contexts(A)|. 

Example 4- Consider the event type set £ = {A,B,C,D,E,F} and the event 
sequence S in Fig. 1. If we look at what happens within 3 time units before 
occurrences of an event type A, we get the following set of contexts 

contexts(A, 5, 3) = {0, {G, T>}, {T>, i?}}. 

When looking at the sequence using the same time window of 3 time units, we 
can notice that the set of contexts of an event type B is exactly the same as the 
set of contexts of the event type A, i.e., contexts(A,5, 3) = contexts(B,5, 3). 

The definition of contexts above is by no means the only possible one. First, 
we could use sequences instead of sets in the definition of con((e,, t,), 5, W), 
i.e., we could define the context of an occurrence of an event type to be a se- 
quence, not a set. This approach would be more in line with the idea of analyzing 
sequences; however, it would lead to severe computational problems. 

The second choice made in the above definition of contexts is that we consider 
only a one-sided context. An alternative would be to define con((e,, t,), 5, IT) to 
be the set of those event types which have an occurrence at a time between ti~W 
and ti + W. This modified definition would be useful, for example, for genome or 
protein data, where both directions of the sequence are equally meaningful. Our 
main applications, however, are in sequences where there is a natural direction, 
that of advancing time, and hence, we use one-sided contexts. 

There seldom is a single natural notion of similarity. For example, by vary- 
ing the set of event types which is considered when computing contexts, or by 
changing the time window, we can obtain very different similarity notions. 

4 Similarity between Sets of Contexts 

The previous section showed how we define the set of contexts of an event type. 
What is left is defining the similarity between two sets of contexts. This turns 
out not to be trivial. 

Each context of an occurrence of an event type is a subset of the set of 
all event types £. We can, hence, identify each context with an m-dimensional 
vector of O’s and I’s, supposing that the number of the event types in £ is m. 
For two such vectors, the similarity or distance between them can be naturally 
defined as the Hamming distance, i.e., the number of positions in which the 
vectors differ. This corresponds to using a symmetric difference between the sets 
of events. Given two event types of A and B, their sets of contexts contexts(A) 
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and contexts(B) are, however, two sets of vectors in an m-dimensional space. 
Thus, instead of defining similarity between two context vectors we have to 
define how similar or different two sets of vectors are. 

A statistical approach is to view contexts(A) and contexts(B) as samples 
from two distributions Ja and fs of the m-dimensional hypercube and define 
the similarity between the event types A and B as the similarity between Ja 
and fs- This can, in turn, be defined for example by using the Kullbach-Leibler 

distance [8,9]: V- . r ^ i 

d{fA II / b ) = 2 ^ fA<yX) ■ log 

or the symmetrized version of it: d{fA || fs) + difs || Ja)- Here the summation 
variable x varies over all of the 2™ points of the hypercube, and hence, direct 
application of the formula is not feasible. 

Another related alternative is to view the set contexts(A) as a sample from a 
multivariate normal distribution and compute the likelihood of obtaining the set 
contexts(H) as a sample from that distribution. That is, for an event type C & 
let iiq and be the mean and the variance of coordinate C in the vectors in 
the set contexts(A). Given a vector {5c)ce£ G contexts(H), the likelihood of this 
vector, given the set contexts(A), is proportional to 

n exp(-((<5c - licfl(Tc)) = exp(- {{Sc ~ /(^c))- 



ce£ 



ce£ 



The logarithmic likelihood g{B\A) of the whole set contexts(H) = {(<5jc) | * = 
1, . . . , |contexts(H)|} is then 

g{B\A) = ~Y^Y^ {{Sc - 
i ces 

This formula can be used as a distance function. To make it symmetric, we can 
use the form g{B\A) + g{A\B). 

A problem with this approach is that it can impose a high value of dissimi- 
larity on the basis of a single event type. If for some C & £ we have that the set 
contexts(A) contains no set with C in it, then = 0 and (Tq = 0. If now at least 
one context of B contains C, then g{B\A) = — oo, indicating that B is infinitely 
far from A. In a way this conclusion is justified: no context of A included C, but 
at least one context of B did. However, in most cases we would not like to draw 
such an extreme conclusion on the basis of difference in one event type. 

A way of alleviating this problem would be to use priors on the presence 
of an event type in a context. Intuitively, one way of doing this corresponds to 
adding to each set of contexts an empty context and a context containing all the 
event types in £. Then the variance <t^ cannot be zero for any C. 

We use, however, mainly an even simpler alternative. We identify each set 
contexts(A) with its centroid vector cev(A), i.e., the vector cev(A) = {pi^)ce£^ 
and define the distance between the sets contexts(A) and contexts(H) as the 
Li-distance between the vectors cev(A) and cev(H): 



d(contexts(A), contexts(H)) = |cev(A) — cev(H)| = ^ — /i^|. 

C€£ 



It is well-known that this distance measure is a metric. 
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This approach has the advantage of being robust in the sense that a single 
event type cannot have an unbounded effect on the distance. The measure is also 
very fast to compute. A drawback is that the sizes of the sets contexts(A) and 
contexts(B) are not taken into account. In the application domains we consider, 
all the sets of contexts have more or less the same cardinality, so this problem 
is not severe. 

5 Experimental Results 

We have evaluated our method using both synthetic and real-life data. For rea- 
sons of brevity, we present in the following only some results on two real-life 
data sets, not on synthetic data. 

Evaluating the success or failure of an automatic way of defining similar- 
ity between event types is not as straightforward as evaluating algorithms for 
prediction. For prediction tasks, there normally is a simple baseline (predicting 
the majority class) and for attribute-based data we can use, for example, C4.5 
as a well-understood reference method. For assigning similarity to event types, 
however, no comparable reference points exist. 

5.1 Course Enrollment Data 

From the course enrollment data of the Department of Computer Science at 
the University of Helsinki we selected 18 often occurring courses and used them 
as the event types. Each sequence consisted of enrollments of one student; the 
lengths of the sequences varied from 1 to 18 depending on how many of the 18 
courses the student had enrolled to. Totally, the data set contained enrollments 
of 5 519 students. Because the enrollments to courses had not occurrence times, 
we selected to a context of an occurrence of a course at most 5 courses that the 
student had enrolled to before he enrolled to the course considered. 

The intuitive expectation is that two courses in this data set will be similar if 
they are located approximately at the same stage of the curriculum. Each course 
has a recommended term in which the department suggests it should be taken. 
Thus, to investigate how well our similarity measure satisfies the intuition, we 
compared the distances produced by our method with the background distances 
between the courses. The background distance between two courses is defined 
as the difference (as the number of terms) of the ordinal numbers of the recom- 
mended terms. The background distance takes values from 0 to 7. The results 
of this comparison are shown in Fig. 2. There is a clear correlation between the 
background distance and the distance produced by our method. The correlation 
coefficient between these variables is 0.4666. 

There are some interesting pairs of courses which have a low distance accord- 
ing to our method. For example, the most similar pair of courses is formed by 
the courses Social Role of ADP and Computer Graphics. These are both courses 
that are typically taken during the 3rd or 4th study year, so their sets of contexts 
are rather similar. On the other hand, the distances between the course Social 
Role of ADP and the first year courses are rather high, indicating very different 
sets of contexts, as is natural. 




Similarity between Event Types in Sequences 



277 



Comparison of distances between 18 courses 




0 1 2 3 4 5 6 7 

Distances based on background information 



Fig. 2. A plot of distances based on the background information and Li distances of 
centroid vectors computed from the student enrollment data. 

Also the courses Fundamentals of ADP and Programming Pascal are found 
to be very similar. This is understandable because they are the first two courses 
recommended to be taken. Therefore, the contexts of their occurrences are very 
often empty sets, or only contain the other course of them, and thus, the sets of 
contexts are nearly the same. 

On the other hand, there are pairs of courses which are given a high distance 
by our method. For example, the courses Information Systems and Data Struc- 
tures Project have the highest distance between all pairs of the 18 courses. These 
courses are compulsory courses to every student and they are usually both taken 
somewhere in the middle of the studies. Still, their sets of contexts vary a lot, 
indicating that not all the students follow the recommended study plan. 

The student enrollment data set contained only the selected 18 courses. If 
all the other courses provided by the department were taken into account, the 
distances between the courses would give an even more realistic view of the 
relationships between them. Also taking into account the exact terms when the 
students enrolled to the different courses would change the values of the distance 
measure. In our experiments the context of an occurrence of a course gives us 
only the information that these courses were taken before this course, not how 
many terms ago this was done. If, however, the exact times were considered, 
the sets of contexts could be very different, at least for those courses which are 
recommended to be taken later during the studies. 

5.2 Telecommunication Alarm Data 

In our telecommunication data set there were 287 different alarm types and a 
total of 73 679 alarms. The data was collected during 50 days. The number 
of occurrences of alarm types differed a lot: from one occurrence to 12 186 
occurrences. The mean number of occurrences of alarm types was 257. 

First we experimented with 23 alarm types that occurred from 100 to 200 
times in the whole alarm sequence and computed with our method the distances 
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Fig. 3. Distribution of distances between 23 alarm types occurring 100 to 200 times in 
the whole alarm sequence. 



between them with a time window of 60 seconds. Figure 3 presents the distribu- 
tion of these distances. The distribution resembles the normal distribution. The 
high absolute values of the distances indicate, however, that these alarm types 
in general are not very similar. The most similar pair of the alarm types are 
alarm types 2583 and 2733. In the whole alarm sequence they occur 181 and 102 
times, respectively. The most dissimilar pair is the pair of alarm types 1564 and 
7172 which occur 167 and 101 times in the whole alarm sequence. The number 
of occurrences seems not to explain why the first two alarm types are deter- 
mined similar and the other two very dissimilar. A more probable explanation 
is that the alarm types 2583 and 2733 belong to the same group of alarm types 
whereas alarm types 1564 and 7172 describe very different kind of failures in the 
telecommunication network. When looking more throughly at the distances, we 
noticed that the alarm type 7172 is pretty dissimilar with all the other alarm 
types, the smallest distance with value over 5 it has with the alarm type 7010. 

From these 23 alarm types we chose some types for the second experiment. 
We considered one of the chosen alarm types at a time. We modified the sequence 
so that every occurrence of the chosen alarm type A was indepedently changed 
to an event of the type A with the probability of 0.5. Then we computed the 
contexts for each of the now 24 alarm types and the distances between these sets 
of contexts. The assumption was the same as with the synthetic data: the alarm 
types A and A should be the most similar alarm type pair. This was, however, 
not the case with none of the chosen alarms. The reason was that the sets of 
contexts of the original alarm types were not very homogeneous. Therefore, the 
sets of the contexts of alarm types A and A could not be that, either. 

For the third experiment with the telecommunication alarm data, we chose 
eight alarm types that occurred from 10 to 1 000 times in the whole alarm se- 
quence. By looking at these alarm types, we wanted to find out how the distances 
change when the size of the time window W varies. The values of W used were 
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Alarm type 2263 versus other chosen alarm types 




Fig. 4. Comparison of distances between alarm type 2263 and seven other alarm types 
with different sizes of time window W. 



10, 20, 30, 60, 120, 300, 600, and 1200 seconds. Figure 4 describes how the dis- 
tances between alarm type 2263 and seven other chosen alarm types vary with 
the size of the time window W. The number of occurrences of alarm type 2263 
was about 1000. Already with contexts of window size 60 seconds, the order of 
the distances is rather stable: those alarm types that have very dissimilar sets of 
contexts have larger distance than those that have more similar sets of contexts. 
The values of the distances also indicate the difference in the number of occur- 
rences of the chosen alarm types. For example, alarm type 7414 occurs only 10 
times in the whole alarm sequence whereas alarm type 2263 occurs about 1000 
times. Therefore, it is rather obvious that, especially with longer time windows, 
the sets of contexts of these alarm types are very different and the distance be- 
tween them becomes very large. Similar observations can also be made when 
looking at any of the other chosen alarm types. 

6 Conclusions 

We have described a method for defining similarity between event types on the 
basis of contexts in which the event types occur in sequences. We have evaluated 
the method using both synthetic and real-life data and shown that it works well. 

Several extensions of the method are possible. As mentioned in Sect. 3, in- 
stead of using sets as contexts, we could also use sequences. Then, to define the 
similarity between sets of contexts we could not, however, use the straightfor- 
ward centroid approach of this paper. The distance between two event sequences 
can be computed reasonably fast using edit distance type of approaches (see, e.g, 
[11, 18]), but combining these distances to an overall similarity measure for sets 
of context sequences remains an open problem. 

We mentioned in passing in Sect. 3 that one can vary the set of event types 
that are considered when forming the contexts. This gives possibilities for tai- 





280 



H. Mannila and P. Moen 



loring the resulting similarity metric to various needs. Semiautomatic ways of 
finding which event types to consider would, however, be needed. 

The empirical behavior of our method on other application domains is also 
an interesting area for future work. For example, sequences rising from web 
browsing activity and electronic commerce as well as protein sequences are good 
candidates for successful applications. 
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Abstract. Data mining has been widely recognized as a powerful tool to 
explore added value from large-scale databases. One of data mining tech- 
niques, generalized association rule mining with tcixonomy, is potential 
to discover more useful knowledge than ordinary flat association min- 
ing by taking application specific information into account. We proposed 
SQL queries, named TTR-SQL and TH-SQL to perform this kind of min- 
ing and evaluated them on PC cluster. Those queries can be more than 
30% faster than Apriori based SQL query reported previously. Although 
RDBMS has powerful query processing ability through SQL, most data 
mining systems use specialized implementations to achieve better perfor- 
mance. There is a tradeoff between performance and portability. Perfor- 
mance is not necessarily sufficiently high but seamless integration with 
existing RDBMS would be considerably advantageous. Since RDB is al- 
ready very popular, the feasibility of generalized association rule mining 
can be explored using the proposed SQL query instead of purchasing 
expensive mining software. In addition, parallel RDB is now also widely 
accepted. We showed that paralleling the SQL execution can offer the 
same performance with those native programs with 10 to 15 nodes. Since 
most organizations have a lot of PCs, which are not fully utilized. We are 
able to exploit such resources to explore the performance significantly. 
Keywords: data mining, parallel RDBMS, query optimization, PC cluster 



1 Introduction 

Data mining has attracted lots of attention to solve decision support problems 
such as that faced by large retail organizations. Those organizations have accu- 
mulated large amount of transaction data by mean of data collection tools such 
as POS and they want to extract value added information such as unknown buy- 
ing patterns from that large databases. One method of data mining to deal with 
this kind of problem is association rule mining. [1] This mining that is also known 
as “basket data analysis” retrieves information like “90% of the customers who 
buy A and B also buy C” from transaction data. 
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Currently, database systems are dominated by relational database system 
(RDBMS). However most of data mining systems employ special mining engines 
and do not use the query processing capability of SQL in RDBMS. Although 
integration of data mining system with RDBMS provides many benefits, among 
others: easier system maintenance, flexibility and portability, [5] has reported 
that in case of association rule mining this approach has a drawback in perfor- 
mance. Association rule mining has to handle very large amounts of transaction 
data, which requires incredibly long computation time. 

We proposed large-scale PC cluster as cost effective platform for data inten- 
sive applications such as data mining using parallel RDBMS, which offers the 
advantages of the integration without sacrificing the performance. [8] 

SQL approach can be easily enhanced with non-trivial expansion to handle 
complex mining tasks. Recently, [6] has proposed SQL query to mine generalized 
association rule with taxonomy based on Apriori algorithm[3] that we will refer 
as “Sarawagi Thomas”-SQL or ST-SQL from now on. In generalized association 
rules, application-specific knowledge in the form of taxonomies {is-a hierarchies) 
over items are used to discover more interesting rules. 

We propose two new queries to mine generalized association rule, TTR-SQL 
and TH-SQL, and examine their effectiveness through real implementation on 
the PC cluster. We also compare the performance with directly coded C program. 



2 Mining Generalized Association Rule with Taxonomy 

2.1 Association Rule Mining 

A typical example of association rule is l Hif a customer buys A and B then 90% 
of this kind of customers buy also C 1 1. Here 90% is called the confidence of the 
rule. Another measure of a rule is called the support of the rule. 

Transactions in a retail database usually consist of an identifier and a set of 
items or itemset. {A,B,C} in above example is an itemset. An association rule 
is an implication of the form X Y where X and Y are itemsets. An itemset 
X has support s if s% of transactions contain that itemset, here we denote 
s = support{X). The support of the rule X Y is support{X UP). The 
confidence of that rule can be written as the ratio support{X UY) / support{X) . 

The problem of mining association rules is to find all the rules that sat- 
isfy a user-specified minimum support and minimum confidence, which can be 
decomposed into two subproblems: 

1. Find all combinations of items, called large itemsets, whose support is greater 
than minimum support. 

2. Use the large itemsets to generate the rules. 

Since the first step consumes most of processing time, development of mining 
algorithms has been concentrated on this step. 
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2.2 Generalized Association Rule with Taxonomy 



In most cases, items can be classified according to some kind of ”is a” hierar- 
chies. [7] For example ’’Sushi is a Japanese Food” and also ’’Sushi is a Food” 
can be expressed as taxonomy as showed in figure 1. Here we categorize sushi as 
descendant and Japanese food and food are its ancestors. This tree can be imple- 
mented as a taxonomy table such as shown in the same figure. By including tax- 
onomy as application specific knowledge more interesting rules can be discovered. 

Food 



Japanese Chinese Italian 



Sushi Sukiyaki 



Pizza 



DESC 


ANC 


Sushi 

Sushi 

Sukiyaki 

Sukiyaki 

Pizza 

Japanese Food 
Chinese Food 
Italian Food 


Japanese Food 
Food 

Japanese Food 
Food 

Italian Food 

Food 

Food 

Food 

Food 



Fig.l. Taxonomy example and its table 



Since support counting for each itemset must also includes the combinations 
of all ancestors for each item, generally this kind of mining requires significantly 
more time. 



3 Mining Algorithms on SQL 

Most of the algorithms developed to mine association rule was intended to pur- 
suit effectiveness so somehow they neglect integration with existing system. Some 
exception such as SETM[4] reported SQL expression of association rule mining. 
Recently pure SQL implementation of the well known Apriori algorithm has been 
reported but the performance is far behind its object oriented SQL extensions 
or other more loosely integrated approachs [5]. [6] extended the query to mine 
generalized association rule with taxonomy. In this paper we name this SQL 
query as ST-SQL in connection to the names of its authors. In addition [6] also 
extended the query further to handle sequential pattern as well. 

In our experiment we employ ordinary standard SQL since it is widely used. 
We propose a new query to mine generalized association rule that we call TTR- 
SQL and compare it with the ST-SQL. We also examine a variant of TTR-SQL 
named TH-SQL. TH-SQL incorporates candidate pruning feature of Apriori. 



3.1 ST-SQL 

The SQL query of ST-SQL that we used for comparison is described in figure 3. 
This query is based on Cumulate algorithm proposed in [7]. Cumulate algorithm 
itself is based on Apriori algorithm for mining boolean or flat association rule, 
and it is extended with optimizations that make use of the characteristics of 
generalized association rule such as pruning itemsets containing an item and its 
ancestors and pre-computing the ancestors of each item. However since we have 
TAXONOMY table in form {descendant, ancestor), the latter optimization for 





284 



I. Pramudiono et al. 



pre-computing the ancestors is implicitly incorporated. The transaction data is 
normalized into the first normal form {transaction! D , item,) . 

In the first pass, we count the support of the items in the extended transaction 
data to determine the large itemsets for pass 1 (F_l). Here extended transaction 
data is the data that also takes form {transaction! D , item) . It contains not only 
items in transactions but also all their ancestors. It is created by a subquery 
SXTD that employs a union operation. The clause SELECT DISTINCT in the 
subquery is to ensure that there are no duplicate records due to extension of 
items with a common ancestor in the same transaction. 

In the second pass we apply the optimization to prune item pairs that contain 
both the descendant and the ancestor from second pass’ large itemsets F_2. 
This is done with exclusion clause NOT IN that excludes item pairs that match 
{ancestor, descendant) or {descendant, ancestor) in taxonomy table. The rule 
that contains both descendant and ancestor is trivially true and hence redundant. 
[7] proved that this optimization is only needed at second pass since candidate 
itemset pruning of Apriori algorithm guarantees the later passes will not contain 
such kind of itemsets. The query also employs the second pass optimization 
described in [5] that uses F_1 directly instead of materializing C_2 first. 

The query for third pass or later is the same with the query for fiat association 
rule described in [5]. We use the so called Subquery method for support counting 
since it is reported to have the best time. Figure 2 gives illustration how it is 
executed. AGGR is a symbol to denote support counting process such as GROUP 
BY and COUNT. 

First we generate the candidate itemsets for k-th pass CJc by a cascade of 
k — I joins. The first join generates a superset of the candidate itemsets by 
self-joining previous pass’ large itemsets F_k-1. We assume that the items are 
lexicographically ordered. The subsequent joins prune that superset by checking 
the membership in F_k-1 of all its (fc-l)-length subsets. This is done by skipping 
one item after another from the fc-length itemsets in the superset at each join. 
[3] suggested that this pruning can drastically reduce the number of candidate 
itemsets. 

The support counting involves second stage and third stage of figure 2. A 
cascade of k times subquery Q_l is required. Subquery Q_l generates all distinct 
transactions whose I items match the first I items of C_k. This subquery is 
cascaded k times to obtain all fc-length itemsets in the transaction data or k- 
itemsets. Those fe-itemsets are summed up to determine the large itemsets of pass 
fc F_k. Inside each subquery QJ, subquery SXTD is executed again to include 
the Lth item into the result. 



^ Subquery SXTD SubSXTD Subquery Q 1 1 <= | <= k 

I SubQ_l k times 

UNION F k 



k-1 times 



A 




F k-1 F_k-1 TAXONOMY SALES 




AGGR 



C_k SubQ_l-1 SubQ_k 

stage 3 



stage 1 stage 2 

Fig. 2. Execution tree for fc-th pass of ST-SQL {k > 3) 
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CREATE TABLE SALES (id int, item int): 
CREATE TABLE TAXONOMY (desc int, anc int): 



CREATE TABLE F_1 (item.l int, cnt int): 

INSERT INTO F_1 

SELECT item AS item.l, COUNT(*) 

FROM (SUBQUERY SXTD) 

GROUP BY item 

HAVING COUNT(*) >= :min_support: 
SUBQUERY SXTD 

SELECT id, item FROM SALES UNION 
SELECT DISTINCT p.id, p.anc 
FROM SALES p, TAXONOMY t 

WHERE p.item = t.desc: 

- PASS 2 

CREATE TABLE F_2 (iteml int, item2 int, cnt int): 
INSERT INTO F_2 

SELECT il. iteml, i2. iteml, COUNT(*) 

FROM F_1 il, F_1 i2, (SUBQUERY SXTD) tl, t2 
WHERE tl.id = t2.id 
AND tl.item = il. iteml 
AND t2.item = i2. iteml 
AND il. iteml < i2. iteml 
AND (il. iteml, i2. iteml) 

NOT IN ( SELECT anc, desc FROM TAXON- 
OMY 

UNION SELECT desc, anc FROM TAXON- 
OMY) 

GROUP BY il. iteml, i2.item2 
HAVING COUNT(*) >= :min_support: 

— PASS k (k > 2) 

CREATE TABLE C_k (iteml int, item2 int, 

... item_k int): 

CREATE TABLE F_k (iteml int, item2 int, 

... item_k int, cnt int): 

INSERT INTO C_k -candidate itemsets 
SELECT il. iteml, il.item2, 

... , il.item_k-l, i2.item_k-l 
FROM F_k-1 il, F_k-1 i2, F_k-1 i3, 

... F_k-1 i_k 

WHERE il. iteml = i2. iteml 
AND il.item2 = i2.item2 



AND il.item_k-2 = i2.item_k-2 
AND il.item_k-l < i2.item_k-l 
—pruning by checking memberships in (k-1) subsets 
AND il.item2 = i3. iteml —skip iteml 

AND il.item3 = i3.item2 



AND il.item_k-l = i3.item_k-2 
AND i2.item_k-l = i3.item_k-l 



AND il. iteml = i_k. iteml -skip item_k-2 



AND il.item_k-3 = i_k.item_k-3 
AND il.item_k-l = i_k.item_k-2 
AND i2.item_k-l = i_k.item_k-l 

INSERT INTO F_k -large itemsets 
SELECT t. iteml, t.item2, ..., 
t.item_k, COUNT(*) 

FROM (SUBQUERY Q_k) t 

GROUP BY t. iteml, t.item2, ..., t.item.k 

HAVING COUNT(*) >= :min_support: 

— for any 1 between 1 and k 
SUBQUERY Q_1 

SELECT d_l. iteml, d_l.item2, ... , dJ.itemJ, tJ.id 
FROM (SUBQUERY SXTD) t_l, 

(SUBQUERY Q-1-1) r_l-l, 

(SELECT DISTINCT iteml, ... 
item_l FROM C_k) d_l 
WHERE r_l-l. iteml = d_l. iteml 
AND r_l-l.item2 = d_l.item2 



CREATE TABLE SALES (id int, item int): 
CREATE TABLE RXTD (id int, item int): 
CREATE TABLE TAXONOMY (desc int, ai 



CREATE TABLE F_1 (item.l int, cnt int): 
CREATE TABLE R_1 (id int, item_l int): 

CREATE TABLE TAX_H (desc int, anc int): 

INSERT INTO R_1 

SELECT id, item AS item_l 
FROM SALES UNION 

SELECT DISTINCT p.id, p.anc 

FROM SALES p, TAXONOMY t 

WHERE p.item = t.desc: 

INSERT INTO F_1 

SELECT item AS item_l, COUNT(*) 

FROM R_1 

GROUP BY item.l 

HAVING COUNT(*) >= :min_support: 

INSERT INTO TAX_H —taxonomies pruning 
SELECT t.desc, t.anc 

FROM F_1 c, TAXONOMY t 

WHERE t.desc = c.item.l 

- PASS k 

CREATE TABLE RTMP.k (id int, item.] 

item_2 int, ... , item_k-l int) 

CREATE TABLE R_k (id int, item.l int, 
item_2 int, ... , item.k int) 

CREATE TABLE F_k (item.l int, 

item_2 int, ... , item.k int, cnt int) 

— (k-l)-length temporary transaction data 
INSERT INTO RTMP.k 

SELECT p.id, p. item.l, p.item.2, ..., 
p.item.k-1 

FROM R.k-1 p, F.k-1 c 

WHERE p. item.l = c. item.l 

AND p.item.2 = c.item.2 



INSERT INTO R.k — k-length itemsets 

SELECT P-id, p. item.l, p.item.2, 
... , p.item.k-1, q.item 
FROM RTMP.k p, RTMP.k q 

WHERE p.id = q.id 

AND p. item.l = q. item.l 

AND p.item.2 = q.item.2 



INSERT INTO F.k —large itemsets 

SELECT item.l, item.2, ..., item.k, 
COUNT(*) 

FROM R.k 

GROUP BY item.l, item.2, ..., item.k 
HAVING COUNT(*) >= :min.support: 

DROP TABLE RTMP.k: 

DROP TABLE R.k-1: 



Fig. 4. SQL query of TTR-SQL 



INSERT INTO R.k 

SELECT P-id, p. item.l, p.item.2, 

... , p.item.k-1, q.item.k-1 
FROM RTMP.k p, RTMP.k q, C.k c 

WHERE p.id = q.id 

AND p. item.l = q. item.l 



AND p.ite 

AND p.ite 



AND r.l-l.item.1-1 = d.l.item.1-1 

AND r.l-l.tid = t.l.tid 

AND t.l.item = d.l. item.l 

Fig. 3. SQL query of ST-SQL 



AND p.item.k-1 = c.item.k-1 

AND q.item.k-1 = c. item.k 

Fig. 5. SQL query modification of 
TH-SQL 
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3.2 TTR-SQL 

TTR-SQL query is described in figure 4. We named this query TTR-SQL from 
“mining Taxonomy using Temporary Relations” since it utilizes temporary re- 
lations to preserve transaction data between passes in the similar fashion as 
SETM algorithm. [4] However SETM is intended for flat association rule mining 
and it is inefficient to handle generalized association rule mining since we have 
to add all ancestors for each item included in transaction. We employ following 
optimizations to this query: 



1. Prune taxonomies whose descendant is not included in the large itemsets of 
first pass. 

This is obvious since we do not need them in later processing. For example 
with taxonomy in figure 1, suppose that item “Pizza” has too little support 
so that it is excluded from large itemsets of first pass in F_l. Then we can 
remove {Pizza, Italian Food}, {Pizza, Food} and {Italian Food, Food} from 
the taxonomy table. Some of our experiments show the size of taxonomy 
table can be reduced up to 100 times smaller. Mining with many passes will 
receive most benefit from this optimization. 

2. Pruning candidate itemsets containing an item and its ancestor. 

As mentioned before rule, such as “Sushi Japanese Food”, that contains 
both descendant and ancestor is trivially true and we can neglect it. TTR- 
SQL implements this optimization in every pass except first pass. 



In the first pass we include all ancestors of each item from taxonomy table 
that matches the descendant. We use taxonomy table again to prune candidate 
itemsets later but since most of items in transaction data do not meet minimum 
support we can eliminate them from taxonomy table as explained in the first 
optimization. Thus we use the subset of taxonomy table TAX JI instead in second 
optimization’s pruning. Table TAX JI only consists of entries whose descendants 
are included in large itemsets of first pass. The first pass of TTR-SQL differs 
from that of ST-SQL in the way that we generate the extended transaction data 
as a table named R_l. We can replace the generation of R_1 with a subquery to 
avoid materialization cost. 

In other passes we employ second optimization while generating fc-itemsets 
into RJs. The execution tree is shown in figure 7. First, we include A: — 1-itemsets, 
that match large itemsets from previous pass FJs-1, into the k — 1-length tem- 
porary transaction data RTMPJs along with their transaction IDs. Then this 
RTMPJs is used to generate lexicographically ordered fc-itemsets by self-join. 
During this generation process, we exclude fc-itemsets that contains both de- 
scendant and ancestor using taxonomy table subset TAX JI. This second opti- 
mization also uses exclusion clause NOT IN, the same way as the second pass 
of ST-SQL. We only need to check the k — 1-th item pairs since the items in the 
itemsets are lexicographically ordered so that previous items are already checked 
in the previous passes. The generated fc-itemsets are included in RJs that also 
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contains transaction IDs. Lastly, large itemsets in F_k are determined from those 
A'-iternsets whose support are larger than minimum support. 

Temporary transaction data RTMP_k need not always be materialized. We 
could avoid the materialization cost of RTMP_k by replacing RTMP_k with a 
subquery and include the subquery into R_k generation query. 

R_k 




Fig. 6. Execution tree for second pass 
of ST-SQL 



Fig. 7. Execution tree for A-th pass of 
TTR-SQL (A > 2) 



3.3 TH-SQL 



We can incoiporate the candidate pruning feature of .A.priori algorithm into the 
TTR-SQL to avoid pruning with taxonomy table in every pass which could be 
expensive. By doing this, we do not need second optimization of TTR-SQL in 
the first pass anymore since one pruning with taxonomy table in second pass 
alone is sufficient. Since this query combines best features from TTR-SQL and 
ST-SQL. we call it TH-SQL which stands for “Taxonomy Hybrid”. 

TH-SQL 's modification to the TTR-SQL query for pass A > 3 is depicted 
in figure 5 where we replace the exclusion clause with a selection join with 
candidate itemsets C_k. C-k is generated at the beginniiig of each pass using the 
same ciuery as ST-SQL. However the second pass remains the same with TTR- 
SQL except we use TAXONOMY instead of TAX.H for the ancestor pruning. 
We show execution tree at pass A > 3 of TH-SQL in figure 8. 



A 

F_k-1 R_k-1 

stage 1 



C_k 




F_k 

I 

AGGR 

I 

R_k 



stage 4 



Fig. 8. Execution tree for A-th pass of TH-SQL (A > 3) 




Fig. 9. Cardinality of RTMPJc (1% 
min. support) 




TTR-SQL IK TH-SQL IX ST #ubqk.k 



Fig.lO. Effect of candidate pruning 
(1% min. support) 
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3.4 Comparison of Three Approaches 

ST-SQL completes its first and second passes faster than two other queries. 
[3] has reported that SETM produces very large R_2 compared with Apriori 
which leads to performance degradation. However we found that the executions 
of these passes are basically using the same execution tree for all queries. The 
most efficient execution tree of second pass ST-SQL is shown in figure 6. It 
allows early filtering of transactions based on the first pass large itemsets F_1 
instead of larger C_2. It only does not materialize the intermediate result of the 
joins. Hence the time difference is limited to the materialization of R_1 and the 
generation of RTMP_2. In addition TTR-SQL also requires generation of TAXJI. 
Our performance evaluation shows that those costs differences are relatively very 
small. 

The time required by ST-SQL is dominated by support counting (stage 2 and 
3) . We can derive from figure 2 that at pass k larger than three, ST-SQL uses the 
result of subquery SXTD each time it executes subquery QJ. This means it has 
to read the entire transactions from disk k times each pass. Thus the execution 
time of ST-SQL at pass A; > 3 is proportional with the k multiplied by the size 
of extended transaction data in addition to the cost for joins in subquery QJ. 
Even if the cost for joins decreases as the number of candidate itemsets becomes 
smaller for large k, this cost of subquery SXTD makes ST-SQL inefficient for 
data mining with many passes. 

On the other hand, the dominan factors in execution of TTR-SQL dan TH- 
SQL are the generation of RTMPJs and RJs. However the size of RTMPJs of 
both TTR-SQL and TH-SQL generally shrinks as the k increases. This general 
behaviour is shown in figure 9. The figure also shows that the size of extended 
transaction data can be one magnitude larger than the original transaction data. 
RJs is also becoming smaller as well. We can expect that our proposed queries 
will perform better at passes over two for most of datasets. 

The size of RJs relation of TH-SQL is smaller than that of TTR-SQL because 
of the candidate pruning beforehand. Figure 10 shows how this pruning reduces 
the number of candidate itemsets. This size reduction will affect the time required 
for the support counting and RTMPJs generation stages. 

Thus we expect that execution time of TH-SQL < execution time of TTR- 
SQL < execution of ST-SQL which we are going to examine at the next section 
by real implementation. 



4 Performance Evaluation 

4.1 Parallel Execution Environment 

The experiment is conducted on a PC cluster developed at Institute of Industrial 
Science, The University of Tokyo. This pilot system consists of one hundred 
commodity PCs connected by ATM network named NEDO-100. We have also 
developed DBKernel database server for query processing on this system. Each 
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PC has Intel Pentium Pro 200MHz CPU, 4.3GB SCSI hard disk and 64 MB 
RAM. 

The performance evaluation using TPC-D benchmark on 100 nodes cluster is 
reported[8]. The results showed it can achieve significantly higher performance 
especially for join intensive query such as query 9 compared to the current com- 
mercially available high end systems. 



4.2 Dataset 

Synthetic transaction data generator developed at IBM Almaden is used for 
this experiment with parameters described in table 1. [3] Transaction data is 
distributed uniformly corresponds to transaction ID among processing nodes’ 
local hard disk while the taxonomy table is replicated to each node. 



4.3 Performance Evaluation of ST-SQL, TTR-SQL, and TH-SQL 

In this section, we compared the three SQL queries proposed in the previous 
section. The performance evaluation is done on five nodes. The experiments 
over varied number of nodes will be given at the next subsection. 

Figure 11 shows a typical execution time of the three queries. The minimum 
support is set to 2.5%. The mining is four passes long. We can see that although 
ST-SQL is superior to the other queries up to second pass, it spends too much 
time in the third pass and later. In overall, TTR-SQL and TH-SQL are 30 
% faster than ST-SQL for this minimum support. Since the required time for 
ST-SQL proportional to the number of required passes k, ST-SQL will suffer 
when k is larger, such as when the minimum support is smaller. This is a major 
drawback since usually we want smaller minimum support with higher confidence 
to generate more interesting rules. 

But contrary to our expectation, even that tuples in RJs of TH-SQL is re- 
duced up to 50% than that of TTR-SQL as shown in figure 12, we do not see any 
remarkable performance improvement with TH-SQL when compared to TTR- 
SQL. Since the dataset itself is small, the gain achieved might be relatively small. 
Thus we are planning to perform larger scale experiments to examine it. 

Candidate generation of SETM at second pass generally produces extremely 
large R_2. However current pruning method is useless at this pass. An effective 
pruning of R_2 will considerably improve the performance of SETM as well as 
TTR-SQL and TH-SQL, which is also left for further investigation. 

The speedup ratio, that is the gain achieved by parallelization, of our al- 
gorithms shown in figure 13 indicates that they can be parallelized well. The 
execution is 9 times faster with 10 nodes. This result supports the feasibility of 
data mining with parallel RDB engine, we will give report on comparison with 
specialized data mining program in the next subsection. 

However the smaller size of data at each node degrades the speedup ratio 
when we increase the number of nodes over 20. At this point the parallelization 
overhead, such as communication cost, corrupts the gain. 
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Table. 1. Dataset paremeters 



Size of transaction data 
Number of transactions 
Average transaction length 


12.8MB 

200000 

5 


Size of taxonomy 


182KB 


Number of items 


20000 


Number of roots 


50 


Number of levels 


3 


Average fanont 


20 
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Fig. 11 .Execution time and pass 
contribution(l node) 
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Fig. 12. Cardinality of RJs 




Fig. 13. Speedup ratio 
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Fig. 14. Execution trace for ST-SQL (5 Fig. 15. Execution trace for TTR-SQL 
nodes, 2.5% min. support) (5 nodes, 2.5% min. support) 
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Fig. 16. Execution trace for TH-SQL 
(5 nodes, 2.5% minimum support) 
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Fig.17. Execution time comparison 
with directly coded program 
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We have recorded the execution traces of ST-SQL, TTR-SQL and TH-SQL 
as shown in figure 14 to 16 respectively. Note that most of the time the query 
execution is CPU bound process. Significant network activities are observed 
only during the aggregation process when nodes exchange data to obtain overall 
support count. 

When we look further into the execution trace of ST-SQL query, during the 
second stage of third pass we will recognize three bursts of disk read indicating 
the execution of subquery SXTD. Between those burst we see drops in the free 
memory that indicates that the intermediate result of each subquery Q_l con- 
sumes considerable amount of memory. Another observation to the fourth pass 
will reveals four similar patterns of disk activities dominate the execution time 
of this pass. The trace gives us the evidence for our analysis that this kind of 
disk read will dominate the pass with small candidate itemsets C Js thus ST-SQL 
is not suitable for mining with many passes. 

Execution trace of TTR-SQL in figure 15 shows similar pattern repeated 
every pass at passes more than two. It also reveals that CPU becomes idle when 
it materializes the relation into disk. However in contrast to ST-SQL, it utilizes 
memory better that allows TTR-SQL to handle larger transaction data. We also 
observe that the introduction of candidate itemset pruning in TH-SQL does not 
affect the execution trace in figure 16 much because the size of TAX_H, FJs and 
CJs are very small. 

4.4 Performance comparison of parallelized TTR-SQL with 
sequential C-code 

In this section, we would like to compare the performance of directly coded C 
program with the parallelized SQL. It is true that C code is much faster than 
the SQL but by employing SQL we are able to integrate mining mechanism into 
the database system seamlessly. Recently so many PCs are used in any orga- 
nizations and all of them are not necessarily fully utilized at all. In addition, 
most of the relational middleware at present have the capability of parallel ex- 
tension. We could exploits such potential of parallelization by utilizing abundant 
resources. Here we use the SQL engine which we developed over PC cluster [8]. 
The performance evaluation results are shown at figure 17. The results for two 
different minimum supports are given. On average, we can achieve the same level 
of execution time by employing 10 - 15 nodes. Due to the space limit, we are 
not able to show the detail trace information of parallel execution but still there 
remains some room for optimization to reduce synchronization. We exemplified 
that we could archive reasonable performance by activating around 10 nodes. 
This might look expensive. But recent reduction of hardware price suggests that 
SQL implementation on PC cluster would be one of the alternatives. 

5 Summary and Conclusion 

The ability to perform data mining using standard SQL queries will benefit 
data warehouses with the better integration with RDBMS. It also allows easier 
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porting codes among different systems, a task that require a lot of efforts with 
specialized “black box” programs. 

Generalized association rule mining with taxonomy is one of the complicated 
mining task that used to depend on specialized programs. As far as authors know 
only one pure SQL-92 query available to perform data mining on generalized 
association rule, namely ST-SQL that is proposed in [6]. 

We presented two new SQL queries, TTR-SQL that utilizes temporary re- 
lations and TH-SQL that combines the best features from two queries men- 
tioned earlier such as candidate pruning. We have evaluated the three queries 
and showed that our proposed queries can achieve better performance up to 30% 
for data mining with four passes. We can expect more improvement with more 
passes. Our analysis indicated that the execution time at pass k of ST-SQL is 
at least proportional with k times the size of transaction data. This results in 
poor performance for pass larger than three. 

PC cluster is a prospective platform for parallel RDBMS with its high cost- 
performance. We also have made performance comparison with data mining 
program written natively in C. We found that 10 to 15 nodes are enough to 
match the specialized program. 
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Abstract. One of the most challenging problems in data manipulation 
in the future is to be able to eflfciently handle very large databases but 
also multiple induced properties or generalizations in that data. Pop- 
ular examples of useful properties are association rules, and inclusion 
and functional dependencies. Our view of a possible approach for this 
task is to specify and query inductive databases, which are databases 
that in addition to data also contain intensionally dehned generaliza- 
tions about the data. We formalize this concept and show how it can 
be used throughout the whole process of data mining due to the closure 
property of the framework. We show that simple query languages can be 
dehned using normal database terminology. We demonstrate the use of 
this framework to model typical data mining processes. It is then pos- 
sible to perform various tasks on these descriptions like, e.g., optimizing 
the selection of interesting properties or comparing two processes. 



1 Introduction 

Data mining, or knowledge discovery in databases (KDD), sets new challenges to 
database technology: new concepts and methods are needed for general purpose 
query languages [8]. A possible approach is to formulate a data mining task as 
locating interesting sentences from a given logic that are true in the database. 
Then the task of the user/analyst can be viewed as querying this set, the so-called 
theory of the database [12]. 

Discovering knowledge from data, the so-called KDD process, contains several 
steps: understanding the domain, preparing the data set, discovering patterns, 
postprocessing of discovered patterns, and putting the results into use. This is a 
complex interactive and iterative process for which many related theories have to 
be computed: different selection predicates but also different classes of patterns 
must be used. 

For KDD, we need a query language that enables the user to select subsets 
of the data, but also to specify data mining tasks and select patterns from the 
corresponding theories. Our special interest is in the combined pattern discovery 
and postprocessing steps via a querying approach. For this purpose, a closure 
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property of the query language is desirable: the result of a KDD query should be 
an object of a similar type than its arguments. Furthermore, the user must also 
be able to cross the boundary between data and patterns, e.g., when exceptions 
to a pattern are to be analysed. This gives rise to the concept of inductive 
databases, i.e., databases that contain inductive generalizations about the data, 
in addition to the usual data. The KDD process can then be described as a 
sequence of queries on an inductive database. The inductive database concept 
has been suggested in [8, 11]. In this paper, we use the simple formalization 
we introduced in [4]. However, the topic is different. In [4], we considered the 
MINE RULE operator as a possible querying language on association rule inductive 
databases. Here we emphasize the genericity of the framework and its use for 
KDD process modeling. It leads us to propose a research agenda to design general 
purpose query languages for KDD applications. Our basic message is very simple: 
(1) An inductive database consists of a normal database associated to a subset 
of patterns from a class of patterns, and an evaluation function that tells how the 
patterns occur in the data. (2) An inductive database can be queried (in principle) 
just by using normal relational algebra or SQL, with the added property of being 
able to refer to the values of the evaluation function on the patterns. (3) Modeling 
KDD processes as a sequence of queries on an inductive database gives rise to 
chances for reasoning and optimizing these processes. 

The paper is organized as follows. In Section 2 we dehne the inductive data- 
base framework and introduce KDD queries by means of examples. Section 3 
considers the description of KDD processes and the add- value of the framework 
for their understanding and their optimization. Section 4 is a short conclusion 
with open problems concerning the research in progress. 

2 Inductive Databases 

The schema of an inductive database is a pair TZ = (R, (Qr, e, V)), where R is 
a database schema, Qr is a collection of patterns, V is a set of result values, and 
e is the evaluation function that dehnes pattern semantics. This function maps 
each pair (r, 6i) to an element of V, where r is a database over R and 6i G Qr 
is a pattern. An instance of the schema, an inductive database (r, s) over the 
schema TZ consists of a database r over the schema R and a subset s C Qr. 

Example 1 If the patterns are boolean formulae about the database, V is {true, 
false}, and the evaluation function e{r,6) has value true iff the formula 9 is 
true about r. In practice, a user might select the true or the false formulas from 
the intensionally defined collection of all boolean formulas. □ 

At each stage of manipulating the inductive database (r, s), the user can think 
that the value of e(r, 6) is available for each pattern 6 which is present in the set 
s. Obviously, if the pattern class is large (as it is the case for boolean formulas), 
an implementation can not compute all the values of the evaluation function 
beforehand; rather, only those values e(r,6*) that user’s queries require to be 
computed should be computed. 
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A typical KDD process operates on both of the components of an inductive 
database. The user can select a subset of the rows or more generally select data 
from the database or the data warehouse. In that case, the pattern component 
remains the same. The user can also select subsets of the patterns, and in the 
answer the data component is the same as before. 

The situation can be compared with deductive databases where some form 
of deduction is used to augment fact databases with a potentially inhnite set of 
derived facts. However, within the inductive database framework, the intensional 
facts denote generalizations that have to be learned from the data. So far, the 
discovery of the patterns we are interested in can not be described using available 
deductive database mechanisms. 

Using the above dehnition for inductive databases it is easy to formulate query 
languages for them. For example, we can write relational algebra queries, where 
in addition to the normal operations we can also refer to the patterns and the 
value of the evaluation function on the patterns. To refer to the values of e(r, 6) 
for any 6* G s, we can think in terms of object-oriented databases: the evaluation 
function e is a method that encodes the semantics of the patterns. 

In the following, we hrst illustrate the framework on association (Section 2.1), 
and then we generalize the approach and point out key issues for query evaluation 
in general (Section 2.2). 

2.1 Association Rules 

The association rule mining problem has received much attention since its in- 
troduction in [1]. Given a schema R = {Ai, . . . , A„} of attributes with domain 
{0, 1}, and a relation r over R, an association rule about r is an expression of 
the form A =y B, where X C R and B G R \ X . The intuitive meaning of the 
rule is that if a row of the matrix r has a 1 in each column of X, then the row 
tends to have a 1 also in column B. This semantics is captured by frequency and 
confidence values. Given IT C R, support{W, r) denotes the fraction of rows of r 
that have a 1 in each column of IT. The frequency of A =y B in r is dehned to be 
support {X U {B},r) while its conhdence is support {X U {B} , r) / support{X , r) . 
Typically, we are interested in association rules for which the frequency and the 
conhdence are greater than given thresholds. Though an exponential search space 
is concerned, association rules can be computed thanks to these thresholds on 
one hand and a safe pruning criteria that drastically reduce the search space on 
the other hand (the so-called apriori trick [2]). 

However, the corresponding inductive database schema dehnes intensionally 
all the potential association rules. In this case, V is the set [0, 1]^, and e(r,0) = 
{f{r,6),c{r,6)), where f{r,6) and c{r,6) are the frequency and the conhdence 
of the rule 6 in the database r. Notice that many other objective interestingness 
measures have been introduced for that kind of patterns (e.g., the J-measure [15] 
or the conviction [5]). All these measures could be taken into account by a new 
evaluation function. 

We now describe the querying approach by using self-explanatory notations 
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B 


c 
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" 0 “ 


1 


1 


1 


1 


0 


1 


0 


1 


1 



Table 1. Patterns in three instances of an inductive database. 



for the simple extension of the relational algebra that fits to our need 

Example 2 Mmmg association rules is now considered as querying inductive 
database instances of schema [R, {Qr, e, [0, 1]^)). Let us consider the data set is 
the instance rg in Table 1 of the relational schema R = {A, B, C}. 

The inductive database idb = (ro,so) associates to rg the association rules 
on the leftmost table of Table 1. Indeed, in such an example, the intensionally 
defined collection of all the association rules can be presented. We illustrate (1) 
the selection on tuples, and (2) the selection on patterns in the typical situation 
where the user defines some thresholds for frequency and confidence. 

L CTA^o{idb) = (rijSi) where r\ = (TA:^o(f’o) and si contains the association 
rules in the middle table of Table 1. 

2- 'Te(ro)./>0.5Ae(ro).c>0.7 {idb) = (r2,S2) where r’2 = To and S2 contains the 
association rules from the rightmost table (on the top) of Table 1. 

To simplify the presentation, we have denoted by e(r)./ and e(r).c the values for 
frequency and confidence. □ 

An important feature is that operations can be composed due to the closure 
property. 

Example 3 Consider that the two operations given in Example 2 are composed 
and applied to the instance idb = (ro, sq). Now, Te(ro)./>0.5Ae(ro).c>0.7 {<^A^o{idb)) 
= (f’S) S 3 ) where 73 = (TA:^o(f’o) and S 3 is reduced to the association rule C => A 
with frequency 0.66 and confidence 1. □ 

The selection of association rules given in that example is rather classical. Of 
course, a language to express selection criteria has to be defined. It is out of the 
scope of this paper to provide such a definition. However, let us just emphasize 
that less conventional association rule mining can also be easily specified. 

Example 4 Consider an instance idb = (ro,so)- H can be interesting to look 
for rules that have a high confidence and whose right-hand side does not belong 
to a .set of very frequent attributes F : Te[ro).c>OAr.e{ro).rhs<f,F{idb)) = (ro,si). 
The intuition is that rhs denotes the righ-hand side of an association rule. The 

^ Selection of tuples and patterns are respectively denoted by cr and r. As it is always 
clear from the context, the operation can also be applied on inductive database 
instances while formally, we should introduce new notations for them. 
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rules m si are not all frequent (no frequency constraint) but have a rather hiqh 
confidence while their riqht-hand sides are not very frequent. Indeed, computinq 
unfrequent rules will be in practice untractable except if other constraints can help 
to reduce the search space (and are used for that durinq the mininq process). □ 

The concept of exceptional data w.r.t. a pattern or a set of patterns is in- 
teresting in practice. So, in addition to the normal algebraic operations, let us 
introduce the so-called apply operation, denoted by a, that enables to cross the 
boundary between data and patterns by removing the tuples in the data set such 
that all the patterns are true in the new collection of tuples. 

In the case of association rules, assume the following dehnition: a pattern 0 is 
false in the tuple t if its left-hand side holds while its right-hand side does not hold; 
in the other cases a pattern is true. In other terms, an association rule 6 is true 
in a tuple t £ r iff e({t}, 6 ).f = e({t}, 6 ).c = 1. Let us dehne a((r, s)) = (r', s) 
where r' is the greatest subset of r such that V6* G s, e{r', 6 ).c = 1. Note that 
r' \ r is the collection of tuples that are exceptions w.r.t. the patterns in s. 

Example 5 Continuinq Example 2, assume the instance (ro,S 4 ) where S 4 con- 
tains the rule AC => B with frequency 0.25 and confidence 0.5. Let a((ro, S4)) = 
(r 4 ,S 4 ). Only the tuple (1,0,1) is removed from rg since the rule AC => B is 
true in the other ones. The pattern AC => B remains the unique pattern ($4 
IS unchanqed) thouqh its frequency and confidence in r 4 are now 0.33 and 1, 
respectively. □ 

2.2 Generalization to Other Pattern Types 

The formal dehnition we gave is very general. In this section, we hrst consider 
an other example of data mining task where inductive database concepts can be 
illustrated. We also point out crucial issues for query evaluation. 

One typical KDD process we studied is the discovery of approximate inclusion 
and functional dependencies in a relational database. It can be useful either for 
debugging purposes, semantic query optimization or even reverse engineering 
[3]. We suppose that the reader is familiar with data dependencies in relational 
databases. 

Example 6 Assume R = {A, B, C, D} and S = {E, E, C} with the two follow- 
inq instances in which, amonq others, ^[(G)] C i?[(A)] is an inclusion depend- 
ency and AB -P- C a functional dependency (see Table 2(a-b)). □ 

Dependencies that almost hold are interesting: it is possible to dehne natural 
error measures for inclusion dependencies and functional dependencies. For in- 
stance, let us consider an error measure for an inclusion dependency R[X\ C ^[Y] 
in r that gives the proportion of tuples that must be removed from r, the instance 
of R, to get a true dependency. With the same idea, let us consider an error meas- 
ure for functional dependencies that gives the minimum number of rows that need 
to be removed from the instance r of R for a dependency R : X ^ B to hold. 



Example 7 Continuinq Example 6, a few approximate inclusion and functional 
dependencies are qiven (see Table 2(c)). □ 
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Error 


Eunctional dependencies 


Error 


R[{B)\ C S[{E)\ 


0 


B ^ A 


0.5 


R[{D)] C S[{E)] 


0.25 


C ^ A 


0.25 


5[(E)] C R[{B)] 


0.33 


BC ^ A 


0.25 


R[{C, D)] C S[{E, F)] 


0.25 


BCD A 


0.25 



A 


B 


C 


D 


T 




T 




2 


2 


2 


3 


3 


1 


1 


2 


4 


2 


2 


3 



E 


E 


0 


T 




3 


2 


3 


4 


3 


2 


2 



(a) (b) (c) 



Table 2. Tables for Examples 6 and 7. 



It is now possible to consider the two inductive databases that associate to a 
database all the inclusion dependencies and functional dependencies that can be 
built from its schema. Evaluation functions return the respective error measures. 
When the error is null, it means that the dependency holds. Indeed, here again it 
is not realistic to consider that querying can be carried out by means of queries 
over some materializations of all the dependencies that almost hold. 

Example 8 Contmmng again Example 6, a user might be interested in “select- 
ing” only inclusion dependencies between instances r and s that do not involve 
attribute R.A in their left-hand side and have an error measure lower than 0.3. 
One expects that a sentence like R[{C, D)] C S[{E, E)] belongs to the answer. 
The “apply” operation can be used to get the tuples that are involved in the de- 
pendency violation. One can now search for functional dependencies in s whose 
left-hand sides are a right-hand side of a previously discovered inclusion depend- 
ency. Eor instance, we expect that a sentence like EE G belongs to the answer. 
Evaluating this kind of guery provides information about potential foreign keys 
between R = {A, B, C, D} and S = {E, E, G}. □ 

Query evaluation We already noticed that object-relational query languages can 
be used as a basis for inductive database query languages. However, non-classical 
optimization schemes are needed since selections of properties lead to complex 
data mining phases. Indeed, implementing such query languages is difhcult be- 
cause selections of properties are not performed over previously materialized 
collections. First one must know efhcient algorithms to compute collection of 
patterns and evaluate the evaluation function on very large data sets. But the 
most challenging issue is the formal study of selection language properties for 
general classes of patterns: given a data set and a potentially inhnite collection 
of patterns, how can we exploit an active use of a selection criteria to optimize 
the generation/evaluation of the relevant patterns. 

Example 9 When mining association rules that do not involve a given attrib- 
ute, instead of computing all the association rules and then eliminate those which 
contain that attribute, one can directly eliminate that attribute during the can- 
didate generation phase for freguent sets discovery. Notice that such a simple 
trick can not be used if the given attribute must be avoided in the left-hand side 
only. □ 

The complexity of mining frequent association rules mainly consist of End- 
ing frequent sets. Provided boolean constraints over attributes, [16] show how 
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to optimize the generation of frequent sets using this kind of constraints during 
the generation/evaluation process. This approach has been considerably exten- 
ded in [14]. Other interesting ideas come from the generalization of the apriori 
trick, and it can be found in different approach like [6] or [17]. [6] propose an 
algorithm that generalize the apriori trick to the context of frequent atomsets. 
This typical inductive logic programming tool enable to mine association rules 
from multiple relations. [17] consider query flocks that are parametrized Data- 
log queries for which a selection criteria on the result of the queries must hold. 
When the hlter condition is related to the frequency of answers and queries are 
conjunctive queries augmented with arithmetic and union, they can propose an 
optimizing scheme. In the general framework, three important questions arise: 

1. How to evaluate a class of similar patterns faster than by looking at each of 
them individually? An explicit evaluation of all the patterns of the schema 
against the database (and all databases resulting from it by queries) is not 
feasible for large data sets. Safe pruning criteria have to be found. 

2. How to evaluate patterns without looking at the whole data set? This is an im- 
portant issue to reduce dimensionality of the mining task, e.g., via sampling. 
In somes cases, it might be also possible not to use the data set and perform a 
simple selection over a previously materialized collection of patterns or more 
or less condensed representation [11]. 

3. How to evaluate operation seguences, e.g., in replays, more efficiently? Com- 
piling schemes can be dehned for this purpose. For instance, crucial issues 
are the study of pattern selection commutativity for useful classes of pat- 
terns. The formal study of selection criteria for pattern classes that are more 
complex than frequent sets is to be done. 

A framework for object-oriented query optimization when using expensive 
methods [7] can also serve as a basis for optimization strategies. 

3 Inductive Databases and KDD Processes 

Already in the case of a unique class of patterns, real-life mining processes are 
complex. This is due to the dynamic nature of knowledge acquisition, where 
gathered knowledge often affects the search process, giving rise to new goals in 
addition to the original ones. 

In the following, we introduce a scenario about telecommunication networks 
fault analysis using association rules. It is a simplihed problem of knowledge 
discovery to support off-line network surveillance, where a network manager tries 
to identify and correct faults based on sent alarms. A comprehensive discussion 
on this application is available in [10]. 

Assume that the schema for the data part is R = ( alarm type, alarming 
element, element type, date, time, week, alarm severity, alarm text). We consider 
items as equalities between attributes and values, while rule left-hand and right- 
hand sides are sets of items. Notice also that we use in the selection conditions 
expressions that concern subcomponents of the rules. Typically, one wants to 
select rules with a given attribute on the left-hand side (LHS) or on its right-hand 
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4 


2 


LINK EAILURE 



50 


e(ro).f 


e(ro).c 


alarm_type=llll ^ element_type=ABC 


0.25 


1.00 


alarm_type=222 ^ alarming_element=E2, element_type=CDE 


0.25 


1.00 


alarm_type=llll, element_type=ABC alarm_text=LINK_EAILURE 


0.25 


1.00 


alarm_type=5555 ^ alarm_severity=l 


0.00 


0.00 



Table 3. Part of an inductive database consisting of data part ro (upper table) and 
rule part so (lower table). 



side (RHS), or give bounds to the number of occurring items. Self-explanatory 
notations are used for this purpose. A sample of an instance of this schema is 
given in Table 3. 

Scenario The network manager decides to look at association rules derived from 
ro, the data set for the current month. Therefore, he/she “tunes” parameters for 
the search by pruning out all rules that have conhdence under 5% or frequency 
under 0.05% or more than 10 items (phase 1 in Table 4). The network manager 
then considers that attributes “alarm text” and “time” are not interesting, and 
projects them away (phase 2). The number of rules in the resulting rule set, 
S 2 , is still quite large. The user decides to focus on the rules from week 30 
and to restrict to 5 the maximum amount of items in the rule (phase 3). While 
browsing the collection of rules S 3 , the network manager sees that a lot of rules 
concern the network element E. That reminds him/her of maintenance operation 
and he/she decides to remove all rules that contain “alarming element = E or its 
subcomponent” (phase 4) . We omit the explanation of dealing with the taxonomy 
of components. The resulting set of rules seems not to show anything special. 
So, the network manager decides to compare the behavior of the network to the 
preceding similar period (week 29) and hnd out possible differences (phases 5-6). 
The network manager then picks up one rule, sg, that looks interesting and is 
very strong (conhdence is close to 1 ), and he wants to hnd all exceptions to this 
rule; i.e. rows, where the rule does not hold (phases 7-8). 

Except for the last phases, the operations are quite straightforward. In the 
comparison operation, however, we must hrst replay the phases 3-4. This is 
because we have to remove the held “week” from the schema we used in creating 
rules for week 30, so that we can compare these rules with the rules from week 
29. Then we create for week 29 the same query (except for the week information), 
take the intersection from these two rulesets, and calculate the frequencies and 
conhdences of the rules in the intersection. The search for exceptions is performed 
using the apply operation introduced in Section 2. 

This simple scenario illustrates a typical real-life data mining task. Due to 
the closure property, KDD processes can be described by sequences of opera- 
tions, i.e., queries over relevant inductive databases. In fact, such sequences of 
queries are abstract and concise descriptions of data mining processes. An inter- 
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Phase 


Operation 


Query and conditions 


1 


Selection 


TFi((ro,so)) = (ro,si) 

Fi = e(ro)./ > 0.005 A e(ro).c > 0.05 A \LHS\ < 10 


2 


Projection 


7TT((ro, Si)) = (ri, S 2 ) 

T = R \ {alarm text, time} 


3 


Selection 


-rF 2 ((TCi((ri, S 2 )))) = (r 2 , S 3 ) 

Cl = [week = 30) and F 2 = \LHS U RHS\ < 5 


4 


Selection 


-rF 3 ((r 2 , S 3 )) = (r2,S4) 

F 3 = (alarming element = F*) 0 {LHS U RH S} 


5 


Replay 3-f (week 30) 


rF3(rF2(7ra(crci((ri,S2))))) = (r3,S5)^ 

U = T \ {week}, other conditions as in 3-4 


6 


Replay 3-f (week 29) 


rF3(rF2(7ra(crc2((ri, S 2 ))))) = (rr, se) 

C 2 = (week = 29), other conditions as in 5 


7 


Intersection 


n((r3, S 5 ), (0, Se)) = (C3, S 7 ) 


8 


Apply 


a((r3, ss)) = (rs, S 9 ) 



Table 4. Summary of the phases of the experiment. 

esting point here is that these descriptions can even be annotated by statistical 
information about the size of selected dataset, the size of intermediate collection 
of patterns etc., providing knowledge for further use of these sequences. 

4 Conclusions and Future Work 

We presented a framework for inductive databases considering that the whole 
process of data mining can be viewed as a querying activity. Our simple form- 
alization of operations enables the dehnition of mining processes as sequence of 
queries, thanks to a closure property. The description of a non-trivial mining 
process using these operations has been given and even if no concrete query lan- 
guage or query evaluation strategy is available yet, it is a mandatory step towards 
general purpose query languages for KDD applications. 

Query languages like M-SQL [9] or MINE RULE [13] are good candidates for 
inductive database querying though they are dedicated to boolean and association 
rule mining, respectively. A simple Pattern Discovery Algebra has been proposed 
in [18]. It supports pattern generation, pattern hltering and pattern combining 
operations. This algebra allows the user to specify discovery strategies, e.g., using 
different criteria of interestingness but at a macroscopic level; implementation 
issues or add- value for supporting the mining step are not considered. 

We introduced, as an example, an inductive database for association rules, 
and gave a realistic scenario using simple operations. It appears that without 
introducing any additional concepts, standard database terminology enable to 
carry out inductive database querying and that recent contributions to query 
optimization techniques can be used for inductive database implementation. A 
signihcant question is whether the inductive database framework is interesting 
for a reasonable collection of data mining problems. We currently study KDD 
processes that need different classes of patterns. 
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Abstract. In this paper, we discuss mining with respect to web data referred here as 
web data mining. In particular, our focus is on web data mining research in context of 
our web warehousing project called WHOWEDA (Ware/touse of Web Data). We have 
categorized web data mining into threes areas; web content mining, web structure 
mining and web usage mining. We have highlighted and discussed various research 
issues involved in each of these web data mining category. We believe that web data 
mining will be the topic of exploratory research in near future. 

1 Introduction 

Most users obtain WWW information using a combination of search engines and 
browsers, however, these two types of retrieval mechanisms do not necessarily address 
all of a user’s information needs. Recent studies provide a comprehensive and 
comparative evaluation of the most popular search engines [1] and WWW database 
[15]. The resulting growth in on-line information combined with the almost 
unstructured weh data necessitates the development of powerful yet computationally 
efficient weh data mining tools. Web data mining can be defined as the discovery and 
analysis of useful information from the WWW data. Web involves three types of data; 
data on the WWW, the web log data regarding the users who browsed the weh pages 
and the web structure data. Thus, the WWW data mining should focus on three issues; 
web structure mining, web content mining [6] and web usage mining [2,8,10]. Web 
structure mining involves mining the web document’s structures and links. In [16], 
some insight is given on mining structural information on the weh. Our initial study 
[3] has shown that web structure mining is very useful in generating information such 
visible web documents, luminous web documents and luminous paths; a path common 
to most of the results returned. In this paper, we have discussed some applications in 
web data mining and E-commerce where we can use these types of knowledge. Web 
content mining describes the automatic search of information resources available on- 
line. Web usage mining includes the data from server access logs, user registration or 
profiles, user sessions or transactions etc. A survey of some of the emerging tools and 
techniques for web usage mining is given in [2]. In our discussion here, we focus on 
the research issues in web data mining with respect to WHOWEDA [5,7]. 

2 WHOWEDA 

In WHOWEDA (warehouse of weh data), we introduced our web data model. It 
consists of a hierarchy of web objects. The fundamental objects are Nodes and Links, 
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where nodes correspond to HTML text documents and links correspond to hyper-links 
interconnecting the documents in the WWW. These objects consist of a set of 
attributes as follows: Nodes = [url, title, format, size, date, text] and link = [source-url, 
target-url, label, link-type]. We materialize web data as web tuples representing 
directed connecting graphs, comprised of web objects (Nodes and Links). We 
associate with each web table a web schema that binds a set of web tuples in a web 
table using meta-data in the form of connectivities and predicates defined on node and 
link variables. Connectivities represent structural properties of web tuples by 
describing possible paths between node variables. Predicates on the other hand specify 
the additional conditions that must be satisfied by each tuple to be included in the web 
table. In Web Information Coupling System (WICS) [ 7 ], a user expresses a web query 
in the form of a query graph consisting of some nodes and links representing web 
documents and hyperlinks in those documents, respectively. Each of these nodes and 
links can have some keywords imposed on them to represent those web documents 
that contain the given keywords in the documents and/or hyperlinks. When the query 
graph is posted over the WWW, a set of web tuples each satisfying the query graph 
are harnessed from the WWW. Thus, the web schema of a table resembles the query 
graph used to derive the web tuples stored in web table. Some nodes and links in the 
query graph may not have keywords imposed, and are called unbound nodes and links. 
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publications 
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http://www.cs.stanford.edu/people/faculty.html data mining 

Consider a query to find all data mining related publications by the CS 
faculty starting with the web page http://www.cs.stanford.edu/people/faculty.html. The 
query above may be expressed as shown above. The above query graph is assigned as 
schema to the web table generated in response the query. The schema corresponding 
to the above query graph can be formally expressed as <X„, X], C, P> where X„ is the 
set node variables; x,y,z in the example above, Xj is the set of link variables; - 
(unbound link) and e in the example, C is set of connectivities ; ki A k2 where ki = 
x<->y, k2 = y<e>z and P is a set of predicates as follows : pi A p2 A p3 A p4 such that 
Pi (x) = [x.url EQUALS http://www.cs.standford.edu/people/faculty.html], p2 (e) = 
[e. label CONTAINS "publications"], p3 (y) = [y.text contains "AI or database"], p4 (z) 
= [z.text CONTAINS "data mining"]. The query returns all web tuples satisfying the 
web schema given above. These web tuples contain the faculty page, the faculty 
member’s page that should contain the word such "AI or database" and the respective 
publications page if it contains the word "data mining". Thus, many instances of the 
query graph shown above will be returned as web tuples. We show one of the instance 
of the above query graph below. 




http://www.cs.stanford.edu/people/faculty.html 
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3 Web Structure Mining 

Web structure mining aims to generate structural summary about web sites and web 
pages. The focus of structure mining is therefore on link information, which is an 
important aspect of web data. Given a collection of interconnected web documents, 
interesting and informative facts describing their connectivity in the web subset can be 
discovered. We are interested in generating the following structural information from 
the web tuples stored in the web tables. 

• Measuring the frequency of the local links in the web tuples in a web table. Local 
links connect the different web documents residing in the same server. This informs 
about the web tuples (connected documents) in the web table that have more 
information about inter-related documents existing at the same server. This also 
measures the completeness of the web sites in a sense that most of the closely 
related information are available at the same site. For example, an airline’s home 
page will have more local links connecting the “routing information with air-fares 
and schedules” than external links. 

• Measuring the frequency of web tuples in a web table containing links which are 
interior; links which are within the same document. This measures a web 
document’s ability to cross-reference other related web pages within the same 
document. 

• Measuring the frequency of web tuples in a web table that contains links that are 
global; links which span different web sites. This measures the visibility of the web 
documents and ability to relate similar or related documents across different sites. 
For example, research documents related to “semi-structured data” will be available 
at many sites and such sites should be visible to other related sites by providing 
cross references by the popular phrases such as “more related links”. 

• Measuring the frequency of identical web tuples that appear in a web table or 
among the web tables. This measures the replication of web documents across the 
web warehouse and may help in identifying, for example, the mirrored sites. 

• On average, we may need to find how many web tuples are returned in response to 
a query on some popular phrases such as “Bio-science” with respect to queries 
containing keywords like “earth-science”. This can give an estimation of the results 
returned in response to some popular queries. 

• Another interesting issue is to discover the nature of the hierarchy or network of 
hyperlinks in the web sites of a particular domain. For example, with respect URLs 
with domains like .edu, one would like to know how most of the web sites are 
designed with respect to information flow in educational institutes. What is the flow 
of the information they provide and how are they related conceptually. Is it possible 
to extract a conceptual hierarchical information for designing web sites of a 
particular domain. This will help for example in building a common web schema or 
wrappers for educational institutes. Thus it can make query processing easier. 
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• What is the in-degree and out-degree of each node (web document)? What is the 
meaning of high and low in- and out-degrees? For example, a high in-degree may 
be a sign of a very popular web site or document. Similarly, a high out-degree may 
be a sign of luminous web site. Out-degree also measures a site’s connectivity. 

• If a web page is directly linked to another web page or are near to each other then 
we would like to discover the relationships among those web pages. The two web 
pages might be related by synonyms or ontology or having similar topics, both the 
web pages are in the same server and in that case both the pages may be authored 
by the same person. 

While the above information is discovered at the inter-document level, web 
structure mining can also have another direction - discovering the structure of web 
documents themselves. Web document structure mining can be used to reveal the 
structure (schema) of web pages. While this would be useful for navigational purpose 
and several other operations such comparing and integrating web page schemes can be 
made possible. This type of structure mining would facilitate web document 
classification and clustering on the basis of structure. It will also contribute towards 
introducing database techniques for accessing information in web pages by providing a 
reference schema. Related work on schema discovery of semi- structured documents 
includes [11,12]) and is similar to approach of using representative objects in [14]. 
Another work [13] derives a type hierarchy using measures similar to support and 
confidence encountered earlier, to represent the inherent structure of large collections 
of semi-structured data. 

3.1 Web Bags 

Most of the search engines fail to handle the following knowledge discovery goals: 

• From the query’s result returned by search engines, a user may wish to locate the 
most visible web sites [3,4] or documents for reference. That is, many paths (high 
fan in) can reach that sites or documents. 

• Reversing the concept of visibility, a user may wish to locate the most luminous 
web sites [3,4] or documents for reference. That is, web sites or documents which 
have the most number of outgoing links. 

• Furthermore, a user may wish to find out the most traversed path for a particular 
query result. This is important since it helps the user to identify the set of most 
popular interlinked web documents that have been traversed frequently to obtain the 
query result. 

We have defined a concept of a web bag in [3] and used web bags for the types of 
the knowledge discovery discussed above. Informally, a web bag is a web table 
containing multiple occurrences of identical web tuples. Note that a web tuple is a set 
of inter-linked documents retrieved from the WWW that satisfies a query graph. A 
web bag may only be created by projecting some of the nodes from web tuples of a 
web table using the web project operator. A web project operator is used to isolate the 
data of interest, allowing subsequent queries to run over a smaller, perhaps more 
structured web data. Unlike its relational counterpart, a web project operator does not 
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eliminate identical web tuples autonomously. Thus, the projected web table may 
contain identical web tuples (i.e., a web bag). 

Using web bags, we discover visible web documents, luminous web documents and 
luminous paths [3]. Below we define the three types of knowledge. Then we discuss 
the applications of three types of knowledge, which we are currently working. 

Visibility of Web Documents : Visibility of web documents D in a web table W 
measures the number of different web documents in W that have links to D. We call 
such documents visible since they are visible in the web table as they are linked by 
large number of distinct nodes. The significance of a visible node D is that the 
document D is relatively more important compared to other documents or nodes in W 
for the given query. In a web table, each node variable may have a set of visible nodes. 
All of these may not be useful to the user. Thus, we explicitly specify a threshold 
value to control the search for visible nodes. The visibility threshold indicates that 
there should exist at least some reasonably substantial evidence of the visibility of 
instances of the specified node variable in the web table to warrant the presentation of 
visible nodes. As an application, consider a query graph involving some keywords 
such as " types of restaurants" and "items" given below, where dotted lines implies 
unbound node and link. We assume that such a site is there on WWW which provides 
a list of types of restaurants (i.e., Italian, Asian, etc.) which further have names of 
those restaurants. We also assume that there is a web site which provides list of items 
for all types of restaurants. 

www.test.com items 




The results returned in response to the query graph imposing such predicates in our 
web warehouse system will return the instances of restaurants selling different items. 
For example, the three web tuples corresponding to the query graph are as given 
below. 



www.test.com Pizza 




From the results returned, we can find the most visible web pages by providing very 
high visibility threshold [see [3] for further details). Assume that this gives Z1 as the 
most visible web page (having more incoming links from different URLs) which has 
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details about pizza. This can give an estimate about the different restaurants which sell 
pizzas. By lowering the visibility-threshold, we can get another set of visible web 
pages, and assume that this time we get the set as {Zl, Z2) where Z2 is an instance of 
a web-page which provides details of Pasta. Note that it is possible that some 
restaurants can sell both pizza and pasta. By comparing the set of different URLs 
corresponding to the restaurants, we can derive the association rules such "out of 80% 
of restaurants which offer pizza to their customers, 40% also provide pasta. Further, 
we can cluster (group) these restaurants according to type and can generate rules like 
out of 80% of restaurants which sell pizza, 40% which sell pasta also are of Italian 
types. 

Consider another example where a new business venture wants to do some analysis 
of their web sites which display products for buying. By finding the visibility of its 
web site with respect to other web sites selling such (or related) products, the company 
can find ways to redesign (including changes in product’s price etc.) its web site to 
improve visibility. For example, if a web site sells PC monitors, they must be 
providing links to web sites which sell CPU. Thus, if a web site finds that its visibility 
is lower in comparison to other web sites selling CPUs then the web site needs to 
improve in terms of design, products, etc. 

Luminosity of Web Documents ; Reversing the concepts of visibility , luminosity of 
a web document D in a web table W measures the number of outgoing links, i..e, the 
number of other distinct web documents in W that are linked from D. Similar to the 
determination of visible nodes, we explicitly specify the node variable y based on 
which luminous nodes are to be discovered and the luminosity threshold. As an 
application, one can use luminosity of a web site, displaying a particular or a set of 
products, to identify the companies that make all those products. This will given an 
estimate of the type that a company whenever it makes a product "A" also makes a set 
of products "B and C". Note that a company can make a product B and/or C only with 
out necessarily making a product A or it may be possible that a certain percentage of 
companies demonstrate such rules. We want to generate association rules such as X% 
of all the electric companies which makes a product "A", Y% of them also makes a set 
of other products "B and C" (support). Also, we can generate a rule like whenever a 
company makes a product, it also makes certain other products, for example, X% of 
companies which make a product A may also make a product B and C (confidence). 
Such rules help a new electric company in taking a decision such as the set of products 
the company should start manufacturing together. 

Consider the following web tuples in a web table. 
www.eleccompanv.com www.elecproduct.org/productA 




www.eleccompanv.com 



www.elecproduct.org/productB 




Product B 
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www.eleccompany.com 




w w w . elecproduct.org/productC 
Product C 



www.eleccompany.com 



www.elecproduct.org/productB 
Product B 
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^ 


Z2 













www.eleccompany.com 



www.elecproduct.org/productA 



XI 


1 company C 


^ 




product A 


Z2 






► 





product A 



Note that in aboye example, certain companies ( 20 %) if they make a product A also 
make products B and C. Howeyer, the company C makes only the product A. That is, 
40% of companies which make a product A , 20% of them also make products B and 
C. 

Luminous Paths: Luminous paths in a web table is set of inter-linked nodes (paths) 
which occurs some number of times across tuples in the web table. That is, 
occurrences of this set of inter-linked nodes is high compared to the total number of 
web tuples in the web table. An implication is that in order to couple the query results 
from the WWW, most of the web tuples in the web table has to trayerse the luminous 
paths. As an application, luminous paths can be used to optimize the visualization of 
query results. Once the results are returned, one needs to browse the nodes (web 
pages) in the set of luminous paths only once. For example, it may be possible that 
between two web pages there may exists two paths such that one is a subset of another. 
In that case, common paths (web pages) need to browse only once. 

Another interesting application is to find whether two given queries are 
similar. Consider that two web tables Ti and T 2 corresponding to two query graphs Qi 
and Q 2 . If we find that sets of luminious paths in the two web tables have common sets 
of luminous paths or sub-paths then we can infer that the corresponding query graphs 
are similar. We would also like to find the similar relationships; that is, whether, they 
are conceptually related or the keywords present in two web pages are synonyms to 
each other, or they are topically related. 

4 Web Content Mining 

Web content mining involves mining web data contents. In effect web content mining 
is the analog of data mining techniques for relational databases since we can expect to 
find similar types of knowledge from unstructured data residing in web documents. 
The unstructured nature of web data forces a different approach towards web content 
mining. In WHOWEDA, currently we primarily focus on mining useful information 
from the web hypertext data. In particular, we consider the following issues of web 
content mining in the web warehouse context: 

• Similarity and difference between web content mining in web warehouse context 
and conventional data mining. In case of web data, documents are totally 
unstructured and different attributes in documents may have semantically similar 
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meaning across WWW or vice versa. For example, one web site could display the 
price of same car in numeric figure others may do in words. In order to do content 
mining, one must first resolve the problems of semantic integration across web 
documents. 

• Selection and cleaning of type of data in the WWW to do web content mining. The 
user must be provided the facility to identify a subset of the weh, which pertains to 
the domain of the knowledge discovery task. Then, depending on the specific kind 
of knowledge to be mined another level of data selection must be carried out to 
extract relevant data into a suitable representative model. 

• Types of knowledge that can be discovered in a web warehouse context. The types 
of knowledge to be discovered are as follows: generalized relation, characteristic 
rule, discriminate rule, classification rule, association rule, and deviation rule [9]. 

• Discovery of types of information hidden in a web warehouse which are useful for 
decision making. Weh data sources being heterogeneous , diverse and unstructured, 
are difficult to categorize. In many cases, the user would be even more unsure about 
the knowledge hidden beneath the contents of a document than that in a database. 
An interactive and iterative process is therefore necessary to enable exploratory 
data mining. A suitable data mining query language is one of the means to 
materliaze such a user-mediated process. 

• To perform interactive web content mining. A graphical user interface is helpful for 
interactive mining of multiple-level rules because it facilitates interactive 
modification of the threshold values, warehouse concept mart (discussed later), 
concept levels, output styles and formats. 

5 Web Usage Mining 

Web usage mining [2] is the discovery of user access patterns from web server logs, 
which maintain an account of each user browsing activities. Web servers automatically 
generate large data stored in sever referred as logs containing information about the 
user profile, access pattern for pages, etc. This can provide information that can be 
used for efficient and effective web site management and the user behavior. Apart 
from finding paths traversed frequently by users as a series of URLs, associations 
indicate which sites are likely to be visited together can also be derived. 

In WHOWEDA, the user initiates a coupling framework to collect related 
information. For example, a user may be interested in coupling a query graph “to find 
the hotel information” with the query graph “to find the places of interest”. From this 
query graph, we can generate some user access pattern of coupling framework. We 
can generate a rale like “50% of users who query “hotel” also couple their query with 
“places of interest”. This information can be used in the warehouse in local coupling; 
coupling of materialized web tables containing information on hotels with places of 
interests. Another information that can be of interest is to find coupled concepts from 
the coupling framework. This can be used in organizing web sites. For example, web 
documents that provide information on “hotels” should also have hyperlinks to web 
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pages providing information on “places of interest”. These coupled concepts can also 
be used to design the Warehousing Concept Mart (WCM), discussed in next section. 

6 Warehouse Concept Mart 

Knowledge discovery in web data becomes more and more complex due to the large 
number of data on WWW. We are building the concept hierarchies involving web data 
to use them in knowledge discovery. We call such collection of concept hierarchies a 
Warehouse Concept Mart (WCM). The concept mart is build by extracting and 
generalizing terms from web documents to represent classification knowledge of a 
given class hierarchy. For unclassified words, they can be clustered based on their 
common properties. Once the clusters are decided, the keywords can be labeled with 
their corresponding clusters, and common features of the terms are summarized to 
form the concept description. We can associate a weight at each level of concept marts 
to evaluate the importance of a term with respect to the concept level in the concept 
hierarchy. The concept marts can be used for the following: 

Web Data Mining and Concept Mart 

Warehouse Concept Mart (WCM) can be used for web data or content mining. In web 
content mining, we make use of the warehouse concept mart in generating some of the 
useful knowledge. We are mining association rules techniques to mine the association 
between words appearing in the concept mart at various levels and in the web tuples 
returned as the result of a query. Mining knowledge at multiple levels may help 
WWW users to find some interesting rules that are difficult to be discovered 
otherwise. A knowledge discovery process may climb up and step down to different 
concepts in the warehouse concept mart’s level with user’s interactions and 
instructions including different threshold values. 

7 Conclusions 

In this paper, we have discussed some web data mining research issues in context of 
the web warehousing project called WHOWEDA (VTare/zouse of Web Data). We have 
defined three types of web data mining. In particular, we discussed web data mining 
with respect to web structure, web content and web usage. An important part of our 
warehousing project is to design the tools and techniques for web data mining to 
generate some useful knowledge from the WWW data. Currently we are exploring the 
ideas discussed in this paper. 
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Abstract. Since KDD first appeared the research has been mainly fo- 
cused on the development of efficient algorithms to extract hidden knowl- 
edge. As a result, a lot of systems have been implemented during the last 
decade. A common feature of these systems is that they either implement 
a specific algorithm or they are specific for a certain domain. As new al- 
gorithms are designed, existing systems have to be adapted, which means 
both redesigning and recompiling. Consequently, there is an urgent need 
to design and implement systems in which adding new algorithms or en- 
hancing existing ones does not require recompiling and/or redesigning 
the whole system. In this paper we present the design and implemen- 
tation of DAMISYS {DAta Mining SYStem). The innovative factor of 
DAMISYS is that it is an engine of KDD algorithms which means that it 
is able to run different algorithms that are loaded dynamicly during run- 
time. Another important feature of the system is that it makes possible 
to interact with any Data Warehouse, due to the connection subsytem 
that has been added. 



1 DAMISYS 

The lack of easibily extendible systems integrated with Data Warehouses mo- 
tivated out research in which the main goal was to design a system that had the 
features of Extensibility, Code reusability, GUI independence, DBMS 
independence. Data base integration and Optimization support. 

In this context, the term extensibility means the capability to add, delete 
and/or update the set of algorithms the system can execute. DAMISYS is a 
system in which adding new algorithms does not involve either redesigning or 
compilating the system. 

Studying in detail data mining algorithms [4] it is straightforward to see that 
they share some functions. Division of algorithms in basic operations makes it 
possible to interchange operations among different algorithms. This allows us to 
provide code reusability. Another goal DAMISYS achieves is GUI indepen- 
dence. This means that functions like user query requests, administrative tasks 
and system monitorization are controled by different applications using the same 
communication protocol. 

DBMS independence allows DAMISYS to use multiple data repository 
architectures, rom now on the term data repository will be used to name the 
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element that holds and manages (data storage, recovery, update and query) the 
data we want to analyze. 

Data repository services give DAMISYS the capability to use these systems 
to store final/intermediate results permanently/temporally. There is also other 
useful information, like different preprocessing results from the same original 
data, that can be stored to reduce system response time and to rise system per- 
formance. We have called this use of data repositories data base integration. 
DAMISYS also implements a series of mechanisms to support future optimiza- 
tion policies. Some of these mechanisms are: algorithm division in basic opera- 
tions, intermediate result management, parallel algorithm execution, to name a 
few. [2]. 

2 Architecture 

DAMISYS architecture has two levels of division. The first level defines a number 
of subsystems. Each of these subsystems is subdivided, in a second level, into 
different modules. We call subsystem to each of the components of our design 
that is executed in parallel with other subsystems and performs some general 
system functions. Any of these subsystems could be run concurrently in multiple 
processors in a parallel shared- memory computer. 




Fig. 1. User Connection Subsystem 



On the other hand, modules achieve specific operations. The aim of this 
group of specific operations is to perform the general functions provided by the 
subsystem to others subsystems or to any external program. The differences 
between subsystem and module concept is that the former must be executed 
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in parallel with other subsystems and performs general features; the modules 
are not necessary concurrent components and they deal with specific opera- 
tions. Each of the modules belongs to a unique subsystem and provides specific 
mechanisms to achieve final subsystem tasks. The architecture proposed divides 
DAMISYS system in four different subsystems: User Connection Subsystem, 
Execution Plan Constructor , Engine Subsystem , Data Warehouse Access 
Subsystem As a new user query is received by the User Connection Subsystem 
it is translated into an internal DAMISYS format (internal Representation) . 
The Execution Plan Constructor processes the query and defines how to solve 
it by means of a structure called Execution Plan. Esecution Plans describe 
which algorithms will be used to solve the query and the values of algorithm 
parameters. The Engine Subsystem takes an Execution Plan and executes it. 
The tasks described in an Execution Plan are divided into a series of specific 
transformations and functions that are called Basic Operations . Finally, any 
of the subsystems may require data from the Data Warehouse supporting the 
system (in order to execute algorithms) This service is provided by the Data 
Warehouse Access Subsystem that makes it possible to connect DAMISYS to 
any Data Warehouse system. 



2.1 User Connection Subsystem 

It provides communication services to GUI external applications and it trans- 
forms messages sent by these applications into an Internal Representation 
This subsystem also provides user validation and role checking each time a user 
connection is established. User interfaces do not need to be executed on the same 
machine where the DAMISYS is running, as its communication interface is able 
to provide remote request submission, as well as concurrent user interfaces con- 
nection. User Connection Subsystem has been divided into three modules (see 
Figure 1(a)): 

User Interface Communication //oduie controls the information exchanged 
between DAMISYS and remote GUIs. It provides abstract interface functions 
that hides protocol-dependent implementations. On the other hand, Query Par- 
ser analyses and checks lexical and syntax sentence construction and translates 
it into a DAMISYS format. Semantical checking is performed by another module. 



2.2 Execution Plan Constructor 

This subsystem creates E’ccecution Plans from the Internal Representation 
of user sentences using a high description of the algorithm. The Execution Plan 
Constructor has been structured in three modules (see Figure 1(b)): 

The service offered by this subsystem starts when an Internal Represen- 
tation from User Connection Subsystem is submitted to Query Analyzer 
Module : In case of an administration command, it is sent to the Administrati- 
on Engine, otherwise it is compiled by the Algorithm Compiler Module that 
uses a high level description of the appropriate algorithm, sets the values of the 
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algorithm parameters, and finally, this module submits an Execution Plan to 
the Engine Subsystem and requests its execution service^. 

2.3 Engine Subsystem 

This is the most important component of DAMISYS architecture because this 
subsystem deals with the resolution of Execution Plans . This function is achieved 
by translating them into a chain of transformations that are implemented in 
components that are loaded dynamicly. These component executions are called 
instances. This subsystem is composed by the following modules: Virtual Ma- 
chine, Dynamic Loader and Working Area. 

Once the Execution Plan is obtained, the engine performs a series of steps 
in order to get a chain of Basic Operations ready to be started to run inside 
Virtual Machine module the engine. Thus, in this module. Execution Plans 
are read and interpreted, to obtain the group of Basic Operation which are 
needed to execute the algorithm. All this process requires the next steps: Execu- 
tion Plan interpretation; Construction of Basic Operations chain; Execution 
of the chain; and Result returning. The Working Area contains the internal 
data components and their manager as well as different system resources. In 
order to be able to manage the amounts of data used and created into Engine 
Subsystem some structures are required. These structures are called Inter- 
nal Tables, and represent data base tables, which are read and written by the 
Basic Operations chains. The Internal Tables are stored in the Working A- 
rea. The main feature of Internal Tables, from the point of view of memory 
usage, is the pages division. Dynamic Loader module offers instances of Basic 
Operations of algorithms. The load of basic components of an algorithm is done 
when system needs its execution. Loader module disposes of a Basic Components 
Cache where it sets those components loaded at that moment into system. 

2.4 Data Warehouse Access Subsystem 

This module provides a common communication method between DAMISYS 
and any Data Warehouse system. This subsystem is divided into four main mod- 
ules: Query Submission Module, Query Result Receiver, Query Result 
Repository and Connection Manager. Query Submission Module receives 
the requests and processes them before sending commands to the Data Ware- 
house system. Queries are temporally stored in the Query Result Repository 
Query Submission Module runs without interruption and it does not wait for 
query results. As a consequence, multiples queries could be solved in parallel. 

When an answer is received from the Data Warehouse Query Result Re- 
ception Module matchs this answer with the request stored in Query Result 
Repository and submits the message to the requester subsystem. 

^ In order to apply a specific algorithm this module has to compile its high level 
description. This algorithm description is defined using DAMISYS/ALG language. 
DAMISYS/ALG grammar has a like syntax with some simplifications. Detailed 
syntax of this language is a broad topic to be completely described in this paper. 
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Finally, Connection Manager provides a series of functionalities that allow 
to the other modules to access the Data Warehouse services. This module im- 
plements the abstract interface between DAMISYS system and the data source. 
This function avoids direct interaction among the rest of the modules of this 
subsystem and the specific protocol required for a particular Data Warehouse 
architecture in a concrete configuration^. 

3 Conclusions and Future Work 

All the objectives proposed in the section 1 has been completely achieved. 

Although optimization mechanisms are implemented, there are only some 
naive optimization policies developed. Our research is now focused on provide 
more complex and useful policies that may enhance DAMISYS system perfor- 
mance. The addition of new policies does not require a new design of any of the 
subsystems, because new policies only need subsystem mechanisms to perform 
their action, and these mechanisms are already available. 

TCP/IP protocol may be translated into CORE A communication. This change 
could be performed to interconnect DAMISYS system with GUI applications and 
Data Warehousing system, as well as, to distribute DAMISYS subsystems among 
different computers. 
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Abstract. Data mining can be used to extensively automate the data analysis 
process. Techniques for mining interval time series, however, have not been 
considered. Such time series are common in many applications. In this paper, 
we investigate mining techniques for such time series. Specifically, we propose 
a technique to discover temporal containment relationships. An item A is said 
to contain an item B if an event of type B occurs during the time span of an 
event of type A, and this is a frequent relationship in the data set. Mining such 
relationships allows the user to gain insight on the temporal relationships 
among various items. We implement the technique and analyze trace data 
collected from a real database application. Experimental results indicate that 
the proposed mining technique can discover interesting results. We also 
introduce a quantization technique as a preprocessing step to generalize the 
method to all time series. 



1 Introduction 

Numerous data mining techniques have been developed for conventional time 
series (e.g., [1], [13], [3], [10], [14].) In general, a time series is a sequence of values 
of a given variable ordered by time. Existing mining techniques treat these values as 
discrete events. That is, events are considered to happen instantaneously at one point 
in time, e.g., the speed is 15 miles/hour at time t. In this paper, we consider an event 
as being “active” for a period of time. For many applications, events are better treated 
as intervals rather than time points [5]. As an example, let us consider a database 
application, in which a data item is locked and then unlocked sometime later. Instead 
of treating the lock and unlock operations as two discrete events, it can be 
advantageous to interpret them together as a single interval event that better captures 
the nature of the lock. When there are several such events, an interval time series is 
formed. An example is given in Figure 1 ; interval event B begins and ends during the 
time that interval event A is occurring. Furthermore, interval event E happens during 
the time that interval event B happens (is active). The relationship is described as A 
contains B and B contains E. Formally, let BeginTime(X) and EndTime(X) denote the 
start time and end time of an event X, respectively. Event X is said to contain event Y 
if BeginTime(X) < BeginTime(Y) and EndTime(X) > EndTime(Y). We note that the 
containment relationship is transitive. Thus, A also contains E in this example (but 
this and several edges are not shown to avoid clutter). 

Mukesh Mohania and A Min Tjoa (Eds.): DaWaK’99, LNCS 1676, pp. 318-330, 1999 
© Springer-Verlag Berlin Heidelberg 1999 
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The problems of data mining association rules, sequential patterns and time series 
have received much attention lately as Data Warehousing and OLAP (On-line 
Analytical Processing) techniques mature. Data mining techniques facilitate a more 
automated search of knowledge from large data stores which exist and are being built 
by many organizations. Association rule mining [2] is perhaps the most researched 
problem of the three. Extensions to the problem include the inclusion of the effect of 
time on association rules [6] [11] and the use of continuous numeric and categorical 
attributes [12]. Mining sequential patterns is explored in [4]. Therein, a pattern is a 
sequence of events attributed to an entity, such as items purchased by a customer. 
Like association rule mining, [4] reduces the search space by using knowledge from 
size k patterns when looking for size k+1 patterns. However, as will be explained 
later, this optimization cannot be used for mining interval time series. In [9], there is 
no subgrouping of items in a sequence; a sequence is simply a long list of events. To 
limit the size of mined events and the algorithm runtime, a time window width is 
specified so that only events that occur within time w of each other are detected. 
Unlike [4], the fact that sub-events of a given-event are frequent cannot be used for 
optimization purposes. 

The name interval event sequence does not imply that the interval events happen 
sequentially, as we have seen that intervals may overlap. A partial order can be 
imposed on the events to transform the sequence into a graph. Let this relation be 
called the containment relation. Applying this relation to the above example yields 
the graph in Ligure 2. This graph represents the containment relationship between the 
events. A directed edge from event A to event B denotes the fact that A contains B. 
We note that a longer event sequence would normally consist of several directed 
graphs as illustrated in Ligure 2. Lurthermore, events can repeat in a sequence. Lor 
instance, events of type A occur twice in Ligure 2. Each event is a unique instance, 
but the nodes are labeled according to the type of event. 




Fig. 1. Interval Events 



Fig. 2. Containment Graph 



Data mining can be performed on the interval sequence by gathering information 
about how frequently such containments happen. Given two event types S and D, all 
edges S->D in the containment graph represent instances of the same containment 
relationship. Therefore, associated with each containment relationship is a count of 
its instances. For example, the count for A contains 5 is 2 in Figure 2. Given a 
threshold, a mining algorithm will search for all containments, including the transitive 
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ones, with a count that meets or exceeds that threshold. These mined containments 
can shed light on the behavior of the entity represented by the interval time series. 
The proposed technique can have many applications. We will discuss some in section 
2 . 

A related work was presented in [7]. Therein, a rule discovery technique for time 
series was introduced. This scheme finds rules relating patterns in a time series to 
other patterns in that same or another series. As an example, the algorithm can 
uncover a rule such as “a period of low telephone call activity is usually followed by a 
sharp rise in call volume.” In general, the rule format is as follows: 

If A 7 and A2 and ... and Ah occur within V units of time, then B occurs 
within time T. 

This rule format is different from the containment relationship defined in the 
current paper. The mining strategies are also different. The technique in [7] uses a 
sliding window to limit the comparisons to only the patterns within the window at any 
one time. This approach significantly reduces the complexity. However, choosing an 
appropriate size for the window can be a difficult task. As we will discuss later, our 
technique does not have this problem. 

The remainder of this paper is organized as follows. Section 2 covers some 
applications where this technique is useful. Algorithms, functions, measures and 
other items related to the mining process are discussed in section 3. Experimental 
studies are covered in section 4. Finally, we provide our concluding remarks in 
section 5. 



2 Applications 

Several applications exist where mining containment relationships can provide 
insight about the operation of the system in question. A database log file can be used 
as input to the mining algorithm to discover what events happen within the duration 
of other events; resource, record, and other locking behavior can be mined from the 
log file. Some of this behavior is probably obvious since it can be deduced by 
looking at query and program source code. Other behavior may be unexpected and 
difficult to detect or find because it cannot be deduced easily, as is the case for large 
distributed and/or concurrent database systems. 

Another application area is mining system performance data. For example, a file 
open / file close event can contain several operations performed during the time that 
the file is open. Some of these operations may affect the file, while other operations 
are not directly associated with the file but can be shown to occur only during those 
times which the file is open. Other interesting facts relating performance of the CPU 
to disk performance, for example, can be studied. Although performance data is not 
usually in interval event format, it can be converted to that format by using 
quantization methods. 

In the medical field, containment relationship data can be mined from medical 
records to study what symptoms surround the duration of a disease, what diseases 
surround the duration of other diseases, and what symptoms arise during the time of a 
disease. For example, one may find that during a FFU infection, a certain strain of 
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bacteria is found on the patient, and that this relationship arises often. Another 
discovery might be that during the presence of those bacteria, the patient’s fever 
briefly surpasses 107 degrees Fahrenheit. 

Factory behavior can also be mined by looking at sensor and similar data. The 
time during which a sensor is active (or above a certain threshold) can be considered 
an interval event. Any other sensors active during/within that time window are then 
considered to have a containment relationship with the first sensor. For example, it is 
possible to detect that the time interval during which a pressure relief valve is 
activated always happens within the time interval in which a new part is being moved 
by a specific conveyor belt. 



3 Mining Interval Time Series 



3.1 From Series of Interval Events to Containment Graph 

Both of the containment graphs shown in the introduction are minimally connected 
for simplicity of illustration. However, the algorithms and measures described in this 
paper use a transitively closed version of a containment graph. A straightforward 
algorithm converts an interval event series into this kind of graph. It takes a list of 
event endpoints, sorted by time stamp, of the form 

<time_stamp, event_id, end_point in {begin, end}, event_type) 
where each interval event has two such tuples: one for the beginning time and one 
for the ending time. By having the input in this format, the entire graph can be loaded 
and build with one pass through the input data, and searching the graph for the 
location of a containment (as each new containment is added to it) becomes 
uncecessary. The output is a directed containment graph G=(V,E), where each node 
in V corresponds to an individual interval event and is of the form 
<event_id, event_type, begin_time_stamp, end_time_stamp> 
and each directed edge in E from a node Vi to a node Vk exists because interval event 
Vi contains interval event Vk. The constructed graph is transitively closed in order to 
reduce the complexity of the mining algorithms. 



3.2 Quantization 

It might be desirable to apply interval event mining to a dataset that is not in 
interval event form. Continuously varying data is not fit for mining because of the 
potentially infinite number of different values that a parameter can assume. In such 
cases, there might not be any repetition of containments, rendering the mining 
algorithm useless. By setting thresholds and/or discretizing, quantitative performance 
data can be classified into bins, and these bins can be considered intervals (that is, an 
interval event occurs during the time that the given parameter is within the specified 
bin value range). 
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Suppose we have a day’s worth of log data for CPU, disk and network interface 
usage. By carefully selecting predicates, such as Cl:0<=CPU.busy<30%, 
C2:30%<=CPU.busy<70%, C3:CPU.busy> = 70%, D1 :disk.busy<40%, 

D2:disk.busy>=40%, Nl:network.busy<75%, and N2:network.busy>=75%, 
continuously varying performance data can be transformed into these discrete bin 
values according to which predicate is satisfied by a measurement point. 
Furthermore, whenever two or more of these predicates occur contiguously, the time 
during which this happens can be interpreted as an interval of type X, where X is in 
{Cl, C2, C3, Dl, D2, Nl, N2j. Using these “bin-events”, containments such as 
“when network usage is at or above 55%, disk usage is at or above 40%, and when 
such disk usage is observed, CPU usage during that time dips below 30%, and this 
behavior was observed with P% support” can be discovered. 

Quantization can be done in several ways, and many methods have been 
researched in various areas both within and outside computer science. Some 
important considerations include determining how many discrete values the data 
should be pigeonholed into, the number of observations that should fall into each 
discrete value, and the range of continuous values that each discrete value should 
represent. To achieve some kind of grouping, clustering methods can be used along a 
parameter’s range of observations, thereby coalescing similar values. This, of course, 
assumes that such groups exist in the data. The output of the regression tree methods 
in [8] can be used to segment continuous values into meaningful subgroups. The 
numeric ranges chosen for attributes in output from using [12] can also be utilized for 
segmentation. In the absence of such patterns, another method is to statistically 
separate the continuous data by using standard deviation and average metrics. This is 
the approach used in this paper for transforming the Oracle performance data. 
Another method is to select equally sized ranges, without guaranteeing that each 
range will have an equal number of or a significant number of observations. In 
contrast, the observations could be sorted and then divided up into bins of equal size, 
without regard to the significance of the numeric attribute. The choice of which 
quantization method to use is heavily dependent on the domain that the data is coming 
from. 



3.3 Containment Frequency and Support 

In the field of data mining, a recurrent theme is that of constraint measures that the 
user specifies, which any piece of knowledge extracted must satisfy. Support, 
confidence, and interestingness are some of the most common. In interval time series 
mining, several functions can be used for selecting useful knowledge. Each of these 
measures will be referred to as a counting predicate. The usefulness and 
interestingness of the mined containments depend on which counting predicate is 
chosen. A factor driving the selection is the domain of data being mined, and 
consequently the form that the interval event data takes. 

Some of the most straightforward counting predicates involve measures of items in 
the containment graph. The most obvious counting predicate is the number of times 
that a given containment appears. For this measure, the containment frequency 
measures the number of times a containment appears in the graph, where each 
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containment counted does not share any nodes (interval events) with any other 
instance of that containment. Multipath containment frequency relaxes this 
requirement, and thus counts the number of times a containment exists, given all 
possible paths that exist in the graph. Node frequency is the number of distinct nodes 
which comprises the set of all combined nodes from all the paths for a given 
containment. Similarly, edge frequency is number of distinct edges (size-two 
containments) which comprises the set of all combined edges from all the paths for a 
given containment. Multipath node frequency and multipath edge frequency relax the 
distinctness requirement in a fashion similar to the difference between containment 
frequency and multipath containment frequency, so a node/edge can be counted 
multiple times. Examples of these counting predicates follow. 

3.3.1 Counting Predicates and Containments Enumeration 

Define a containment (or path) as a tuple of the form CC = <nl, n2, nk>, 
where each n(i) is an interval event, and is labeled by its interval event type ID. Each 
n(i+l) is contained by n(i) for all l<=i<=k-l and there exists a directed edge E(n(i), 
n(i+l)) in the containment graph for all such i. As discussed, containment frequency 
can be measured in different ways. In addition, because the graph is a lattice, an 
internal node can have several parent nodes. This property translates into entire 
subpaths that can be shared by several nodes. When counting the frequency of a path, 
should nodes be allowed to appear in more than one path? Eor example, in the 
containment graph in Eigure 3, how often does containment <A, B, X, Y, Z> occur? If 
nodes can appear on more than one path, then the counting predicate is called 
multipath containment frequency and the frequency of containment <A, B, X, Y, Z> is 
2. If the nodes on a path cannot appear on more than one path, then the counting 
predicate is called containment frequency and the result is 1. The edge frequency in 
this example is 6 and node frequency is 7. The relationships between containment 
frequency, edge frequency, node frequency, and the multipath variations of these 
counting predicates will vary according to the shape of the containment graph, which 
in turn is determined by how interval events contain each other. Table 1 shows the 
counting predicates and values corresponding to each predicate. 
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Fig. 3. Shared subcontainments 



Table 1. Counting predicates for Figure 3 



Counting Predicate 


Value 


containment frequency 


1 


multipath containment 
frequency 


2 


edge frequency 


6 


multipath edge frequency 


8 


node frequency 


7 


multipath node frequency 


10 



In determining the support of a containment, to maintain consistency with other 
data mining methods, support is defined as a percentage indicating how much the 
frequency measurement relates to the maximum frequency possible for the problem 
instance. Table 2 shows the percentage of what number, for each counting predicate, 
corresponds to the support percentage of that counting predicate for a given frequency 
of that counting predicate. 



Table 2. Support measures 



Counting Predicate 


Support 


Containment frequency 


percentage of the maximum number of containments 
of that size that can exist 


Multipath containment 
frequency 


percentage of the maximum number of containments 
of that size that can exist, for all possible root-to-leaf 
paths in the graph 


Edge frequency 


percentage of the total number of edges in the graph 


Multipath edge 
frequency 


percentage of the total number of edges in all possible 
root-to-leaf paths of the graph 


Node frequency 


percentage of the total number of nodes in the graph 


Multipath node 
frequency 


percentage of the total number of nodes in all possible 
root-to-leaf paths of the graph 



3.3.2 Multipath Counting Predicates 

When would a multipath counting predicate be favored over its non-shared 
counterpart? A non-shared counting predicate only indicates what percentage of the 
containment graph supports a given containment. It does not readily differentiate 
where there is overlap among instances of a given containment and where there is not. 
For example, in Figure 4, the containment frequency for <B, A, F> is 2 because there 
are at most 2 unique occurrences of this containment given the restrictions of that 
counting predicate. In contrast, the multipath containment frequency is 24 (3*4*2). 
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Likewise, the node frequency is 9, and in contrast the multipath node frequency is 72 
(3*24). In certain problem domains, the fact that there is overlap between several 
instances of the same containment is useful information. Suppose that interval event 
5 is a disk failure, interval event A is a network adapter failure, and interval event F is 
a network failure. The fact that these events happen at approximately the same time, 
thus causing the amount of overlap seen in the example, has a different meaning than 
if these containments happened at different times. The events probably occur together 
because of a malicious program virus attack that is set off at a specific time of day, for 
example. 



3.4 Mining algorithms 

There are several ways to mine the data to find the frequent containments. The 
naive approach is to traverse the lattice on a depth-first basis and at each point of the 
traversal enumerate and count all paths. Another way is to search for containments 
incrementally by path size; this is the approach used in this paper. A path is described 
by the sequence of node types in the path. Because there is a one-to-many mapping 
from the node types to the transaction ID’s, a path can exist multiple times in the 
entire graph. This graph can be traversed using lattice traversal algorithms, or it can 
be stored in relational database tables and mined using SQL statements. 



3.4.1 Naive Algorithm for Mining Containments 

Perform a depth-first traversal of the lattice whereby all the possible paths throught 
the lattice are explored. At each node visit of the traversal, there exists a traversal 
path TP by which this node was reached. This corresponds to the recursive calls that 
the program is following. Path TP is <tpl, tp2, tp3, ..., tpn>, where tpl is the 
topmost node in the path and tpn is the current node (can be internal or leaf node) 
being visited by the traversal algorithm. By definition, tpl has no parent and hence, 
there is no interval event which contains tpl. For each subpath (containment) of TP 
of the form TPS in j<tp(n-l), tpn>, <tp(n-2), tp(n-l), tp>, .... <tpl, tp2, .... tp(n-l), 
tpn>j, increment this subpath’s counter in the path list PL which indicates the number 
of times that this path (containment) appears. When the entire lattice has been 
traversed, the paths in PL that satisfy the counting predicates (such as containment 
frequency > = minimum mining containment frequency) are presented to the user. 
This exhaustive counting method will find all possible containments. Herein lies the 
disadvantage: the number of frequent containments will typically be a small (or very 
small) subset of all the possible containments, so this algorithm might not have a 
chance to run to completion because of the large amount of storage required to store 
all the paths. We discuss this algorithm because it helps to illustrate the mining 
technique. 
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3.4.2 Growing Snake Traversal, Shared Node Mnltiple Containments 

Unlike several other data mining methods, when finding frequent containments it is 
not always possible to prune the search space by using mining results of previous 
iterations. A corresponding statement, if it held, would be the fact that if a 
containment CSUB has frequency CSUB.FREQ for a given counting predicate, then 
any containments CSUPER of which CSUB is a subcontainment possess the following 
property: CSUPER.FREQ <= CSUB.FREQ. Unfortunately, this property can not be 
exploited by mining in stages for incrementally larger containments, because several 
of these larger containments can potentially share a smaller containment. Sharing 
leads to violation of this property. Containment <A, B, X, Y, Z> shown in Figure 3 
illustrates this: the containment frequency for <A, B, X> is 1, but the containment 
frequency for <A, B, X, Y, Z> is 2, a higher value. Results are similar for the other 
counting predicates. 

To reduce the amount of storage required for intermediate results, the Growing 
Snake Traversal, as the name implies, starts by mining all size 2 containments. A 
traversal is done as in the naive algorithm, except that only paths of the form <tp(n- 
1), tpn> are enumerated. When all such containments have been found, only those 
that satisfy the selected counting predicates are retained. Multiple counting predicates 
can be mixed in a boolean expression, forming a counting predicate function to be 
satisfied by each mined containment. Allowing this freedom for the user broadens the 
applications of the mining method because the user can decide what counting 
predicates or counting predicate function(s) must be met by a mined containment in 
order for it to be considered useful knowledge. Next, containments of size 3 (having 
form <tp(n-2), tp(n-l), tpn>) are enumerated and the same counting predicate 
function is applied to select useful containments. This is repeated until the maximum 
containment size is reached. Algorithm 1 contains the details. 

Algorithm 1. 

Input: Containment graph CG, containment predicate 
function CPF 

Output: Set FINAL_CONT of mined containments 
containment_bucket array CA[] (each element containing 
CASIZE containments) 
containment_bucket FINAL_CONT 
int k = 0 

- for containment size CS = 2 to CG.max_containment_size 

- for each containment CCL in CG of size CS 

put CCL in current bucket CA[k] 

- if CA[k] is full 

- sort CA[k] 

- allocate a new bucket CA[k+l] 

- k=k+l 

- endif 

- endfor 

- merge all CCL' s in all CA buckets into the FINAL_CONT 
bucket, putting in only those that meet the 
criteria of sufficient frequency, sufficient node 
frequency, sufficient edge frequency, and/or other 
counting predicate (s) (an n-way merge is used to merge 
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the buckets, or an iteration of 2-way merges could also 

be used) 

- delete all containments in CA 

- endfor 

For each containment size CS, the step of containment enumeration is followed by 
a merge-count because the enumeration has to happen in stages in order to effectively 
use the limited amount of RAM (Random Access Memory) in today’s computers. For 
example, given about 7 hours worth of interval data from discretized performance 
data from a system running an Oracle database application, the memory usage for the 
algorithm can at times exceed 300MB. Randomly accessing such a structure on a 
computer with sufficient disk space to store it but not enough RAM for it all to be on- 
line at once will cause thrashing, rendering the algorithm ineffective. A merge-count 
allows the use of very large datasets. The CASIZE parameter is chosen such that the 
size of each CA[k] is small enough to fit in physical RAM. Although it is not shown, 
our implementation of the algorithm ensures that a containment is not counted twice 
by pruning paths which exist entirely within subsections of the graph which have 
already been visited. For edge frequency and node frequency counting predicates, the 
small number of duplicate edges and nodes that arise during the merge step (as a 
result of paths which are partially in an explored region of the graph) are eliminated 
during the merge phase of the algorithm. 

In our experiments, the entire containment graph was kept on-line. The graph does 
not need to be stored completely on-line, however. A modification to the algorithm 
will permit mining datasets where the containment graph is larger than available 
RAM space by only keeping events in memory that are active during the current 
timestamp. Consequently, the section of the containment graph being mined is built 
dynamically as access to it is required. Our algorithm already resorts to merging for 
generating the mined containments, so a combination of these two techniques yields 
an algorithm that is limited only by available secondary storage. Furthermore, the 
data access and generation pattern (if using multiple 2-way merges) is sequential, so a 
group of devices that support sequential access, such as tape drives, could also be 
used by the algorithm. 



4 Experimental Results 

Experiments were run on a Dell PowerEdge 6300 server with 1GB RAM and dual 
400Mhz Pentium processors for the synthetic data, and on a Dell Pentium Pro 
200Mhz workstation with 64MB RAM. The first experiment consisted of mining 
containment relations from an artificially generated event list. A Zipf distribution was 
used in selecting the event types and a Poisson arrival rate was used for the inter- 
event times. This smaller list is beneficial in testing the correctness of the 
programmed algorithm because the output can be readily checked for correctness. 

In the second experiment, disk performance data from an Oracle database 
application was converted from quantitative measurements to interval events by 
quantizing the continuous values into discrete values. The disk performance data 
consists of various parameters for several disks, measured at 5-minute time intervals. 
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Discrete values were chosen based on an assumed normal distribution for each 
parameter and using that parameter’s statistical z-score. “Low”, “average” and “high” 
were assigned to a value by assigning a z-score range to each discrete value. Values 
used were “low”, corresponding to z-score<-0.5, “average” corresponding to a z- 
score in [-0.5, 0.5], and “high” corresponding to a z-score>0.5. The resulting 
quantized versions of the parameters were close to uniformly distributed in terms of 
the number of occurrences of each range, so this quantization method provided good 
results in this case. 

Some containment results gathered from looking at the output of the sar utility of 
the Sun machine the database was running on are shown in Table 3. Additionally, 
several containments were the average service time parameter of disk id’s 40, 18, 20 
and 25 were near their mean value, contained several other quantized values of 
parameters of other disks, revealing interesting interactions among several disk 
performance metrics which were obtained by running the mining algorithm. Table 4 
shows the CPU run times for mining the Oracle dataset. Figure 5 shows the 
relationship between varying Zipf, Poisson arrival times and number of mined 
interval events for the synthetic data set, which consists of 500 events and 8 event 
types. 



Table 3. Some Oracle dataset results 



Param 1 


Param 2 


Description 


Page faults 

‘high’ 


namei ‘high’ 


During the time that the number of 
page faults is above average, the 
number of namei function requests is 
also high. This is probably an 

indication that files are being opened 
and accessed, thus increasing the 
RAM file cache size and reducing the 
amount of RAM available to execute 
code 


‘average’ CPU 
usage by system 


vflt ‘low’ 


During average usage of the CPU by 
the system code, the number of 
address translation page faults was 
below average. This might be an 
indication that much system code is 
non-pageable, so very little page 
faults are generated 


‘average’ CPU 
usage by system 


slock 

‘average’ 


During average usage of the CPU by 
the system, there is an average 
number of lock requests requiring 
physical I/O. 
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Table 4. CPU time for execution of 
mining algorithm vs. number of 
containments mined for Oracle data 

cpu time (sec) # of events 
40 178 

104 286 

367 335 

543 387 



Fig. 4. Multiple shared subcontainments 
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Fig. 5. Synthetic data results 





5 Concluding Remarks 

Numerous data mining techniques have been developed for conventional time 
series. In this paper, we investigated techniques for interval time series. We consider 
an event to be “active” for a period of time, and an interval time series is a sequence 
of such interval events. We pointed out that existing techniques for conventional 
time series and sequential patterns cannot be used. Basically, interval time series are 
mined differently than event series because an event has both a starting and ending 
point, and therefore the containment relationship has different semantics than simply 
happens-before or happens-after. To address this difference, we proposed a new 
mining algorithm for interval time series. 

To assess the effectiveness of our technique, we ran the mining algorithm on 
system performance trace data acquired from an application running on an Oracle 
database. Traditionally, spreadsheet and OLAP (On-line analytical processing) tools 
have been used to visualize performance data. This approach requires the user to be 
an expert and have some knowledge of what to explore. Unsuspected interactions, 
behavior, and anomalies would run undetected. The data mining tools we 
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implemented for this study address this problem. Our experimental study indicates 
that it can automatically uncover many interesting results. 

To make the techniques more universal, we proposed a quantization technique 
which transforms conventional time series data into an interval event time series, 
which can then be mined using the proposed method. To illustrate this strategy, we 
discussed its use in a number of applications. 
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Abstract. A new modeling technique to mine information from data that are 
expressed in the form of events associated to entities is presented. In particular 
such a technique aims at extracting non-evident behavioral patterns from data in 
order to identify different classes of entities in the considered population. To 
represent the behavior of the entities a Markov chain model is adopted and the 
transition probabilities for such a model are computed. The information 
extracted by means of the proposed technique can be used as decisional support 
in a large range of problems, such as marketing or social behavioral questions. 
A case study concerning the university dropout problem is presented together 
with further development of Markov chain modeling technique in order to 
improve the prediction and/or interpretation power. 



1 Introduction 

This paper presents an approach to the problem of mining information from large data 
sets based on the application of Markov chains. 

Data mining represents the core activity of the so-called Knowledge Discovery in 
Databases (KDD) process, which aims at extracting hidden information from large 
collections of data. Data mining techniques can be divided into five classes of 
methods according to their different goals that is the different kind of knowledge they 
aim to extract [1]. These methods include predictive modeling (i.e. decision trees [2]), 
clustering [3], data summarization (i.e. association rules [4]), dependency modeling 
(i.e. causal modeling [5], [6]) and finally change and deviation detection [7]. 

When the time represents an important attribute characterizing the available 
information, the data can usually be associated with a time-ordered sequence of 
events. The analysis of such a sequence could then provide knowledge about the 
behavior of the system that, at least ideally, has generated the data. The mined 
knowledge in such cases could successively be used to predict, with a sort of “black 
box” pattern matching approach, the evolution of the considered system from the 
observation of its past behavior. To this end, the approach that has been studied in this 
work tries to exploit a model based on the theory of Markov chains in order to 
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provide a statistical representation of the properties of the observed system. The data 
organized into a temporal sequence are mined in order to extract the probability of 
transition among the possible states in which the system could evolve. The specific 
framework for which such probabilities should be identified has to be a priori defined, 
taking into account the general characteristics of the considered context of 
application. In other words, when using the proposed approach, the identification of 
the system states should be considered a modeling parameter that clearly can 
influence the effectiveness of the whole mining process. 

The work presented in this paper has been developed with reference to a particular 
case study that is about the problem of university dropouts. A modeling technique 
based on Markov chains to deal with the data about the university students has been 
developed in order to obtain the population at risk. The paper is organized as follows. 
The paper begins with a short introduction to Markov chains. Then the proposed 
modeling technique is explained step by step also through an example database. 
Finally the dropout case study is presented with possible improvements to the 
proposed model. 



2 Markov Chains 

2.1 Introduction to Markov Chains 

The theory of Markov chains ([8], [9], [10]) is often used to describe the system 
asymptotic behavior by means of relevant simulation algorithms (Gibbs sampling 
[11], Metropolis, Metropolis-Hastings [12]). The use of Markov chains simplifies the 
modeling of a complex, multi-variant population by focusing on the information 
associated with the system state. 

This basic property of Markov chains allows to describe easily the behavior of 
systems whose evolution can be modeled by a sequence of stochastic transitions from 
one state to another in a discrete set of possible states, which occur in correspondence 
of time or events. 



2.2 Definitions and Basic Properties 



Let X® be a set of possible states of a system, at the k-th step or value of time, for any 
entity of a considered population. If the state of an entity at a generic k-th step can be 
expressed through a vector of variables, then X® can be written as follows: 



X® = 





.,x 



(k) 



} 



( 1 ) 



where w is the number of possible states, for the considered entity, at k-th step. 
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Then, such a system could be modeled through Markov chains only if the 
probability distribution of a generic state (jgpends entirely on the 

value of the state vector assumed at the k-th step, i.e., £ X® . 

Formally: 



P(x 



(k+i) 

j 



,(k) 



^(k-1) 
Av 5 



.T:')=p(xr" lx®) 



Vi, 



, W 



( 2 ) 



of course, equation (2) is verified for any step k. 

To define the Markov chain we need to know the initial probability of a generic 

state Xj®, , Vj and the transition probability for any possible state Xj*'^*'* to follow 

the state x* that is denoted by matrix transition probability does not 

depend on the step k (e.g. for stationary systems), the Markov chain is said 
homogeneous and the transition probability could be written as ^ ■ Using the 

transition probabilities, the probability for the state Xj'*' at time k+1, denoted by 
can be easily computed from the correspondent probabilities at time k as 

follows: 





( 3 ) 



Given the vector of initial probabilities, p™, equation (3) determines the behavior 



of the chain for all the time instants. The probabilities at step k can be viewed as a 
row vector, pi‘‘*, and the transition probabilities at step k as a matrix, T®, or simply T if 
the chain is homogeneous. Equation (3) can be expressed as: 

p®T® (4) 

For a homogeneous chain, T‘‘, that is the k-th power of the matrix T, gives the 
transition probabilities at k step to obtain: 

p(""i’= p®T“ (5) 



3 Application of Markov Chains to the Mining of Time Series 
3.1 Mining Information by Means of Markov Chains 

The class of addressed problems takes the form of time series analysis to extract non- 
evident behavioral pattern from data. In the next sections a modeling technique 
aiming at applying Markov chain theory to data mining problems, which can be 
modeled with time-series, is presented. The application of such a modeling approach 
to a case study represented by the analysis of university dropouts will follow. 
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3.2 Definition of the Problem 

In general, given a population made of a finite number of entities, each entity can be 
associated with a series of successive events that characterize its behavior. 

Let e be a generic entity from the considered population and <Sj s^ S 3 ...s_^> a 
sequence of successive events. Then the association between the entity e and its 
relevant series of events can be written as follows: 

e <SjS 2 Sj...s_,> (6) 

where n = n(e) is the number of events of the series. 

Let us consider a database of customer transactions where the various entities are 
represented by the customers and the events by their economic transactions (Table 1). 



Table 1. An example of customer transactions. 



Cust_id 


Transaction_time 


Item_id_bought 


Amount Paid (£) 


1 


20/1/98 


10 


5 


1 


20/1/98 


12 


2 


2 


21/1/98 


2 


3 


3 


22/1/98 


22 


6 











The various items are classified in three market classes, as stated by the following 
Table 2. 



Table 2. An example of market classes for the items. 



Item_id 


Class 


1-10 


A 


11-20 


B 


21-30 


C 



To present a probabilistic approach to analyze the sequence of states that 
characterizes each component of the considered population, as a first modeling step, 
the state of an entity in the time must be defined by specifying a state vector. The 
elements of such a vector are the various variables characterizing the events. These 
variables are provided, in general, as the result of grouping operations on the events. 

Formally the state of an entity, at a time t, is expressed by a state vector that is a 
function of the same entity and of t, that is: 

Vt, te 91" , 3xg X® : f(e, t) (7) 

where X® is the state space of f. 

Such a grouping step could correspond, in the above example, to the definition of a 
set of values for the state variables computed for each customer as the total expense of 
the last month (e.g. 30 days) and the corresponding percentage distribution for the 
different market classes grouped by transaction period. 
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Grouping by transaction period requires the time to be sampled in an appropriate 
way. The sampling should be performed in order to identify time reference points that 
should result significant for the considered case, and allows expressing the state 
function as: 

f(e , tj) , e G P entity set „ 

It 1 ^ ^ 

X G X'' ‘ state space for f ; 

tj G T set of sampling times. 

In the considered transaction example, the 30“' of each month has been chosen as 
reference day, and the time horizon is 12 months. 



Table 3. A possible definition of a state vector for the customers. 



e. 


t, 


m 


a 


b 


c 


1 


30/1/98 


7 


80 


20 


0 


2 


30/1/98 


3 


42 


0 


58 














1 


30/2/98 


5 


30 


20 


50 



In this case, the state vector of the customer entity is made of four variables: 

f (e^, tj) = (m, a, b, c) i=l,...,30; j=l,...,n (9) 

where m represents the total monthly expense, and a, b and c represent the percentage 
distribution of the total monthly expense on the three market classes defined in Table 
2. The values of a, b, c and m are intended to be within suitable admission range. In 
order to be able to represent the problem through a Markov chain model, the 
admission ranges should be specified through a finite number of possible states for 
each tj. In general, this means to identify a discrete space for the state and a mapping 
function that associates an actual state space point with a point in the discrete space. 

In the case of the considered example such a discrete state space could be 
represented by a set of four qualitative levels related to the total monthly expense 
variable. Then the needed mapping could be obtained by specifying the interval of 
values of the actual state variables corresponding to such qualitative values. In terms 
of the state function this could be expressed as: 



f(ej, tj) ^21, , e.G P entity set; 

Xq e discrete state space for f 

tj G T set of sampling times. 



(10) 



In the proposed example, the total expense and the percentage for market classes 
could be approximated by using, for instance, the value mapping specified in Tables 4 
and 5. 
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Table 4 and 5. The value range for the state variables. 



Month_expense 


Value 


Expense < 5 


Very low 


5< Expense <10 


Low 


10<Expense<100 


Average 


Expense >100 


High 



Expense ^distribution 


Value 


0 < Distrib. < 20 


Poor 


20 < Distrib. < 40 


Medium low 


40 < Distrib. < 60 


Medium high 


60 < Distrib. < 80 


Rich 



At the end of these modeling steps, each customer and each sampling time are 
associated with a record containing the aggregated sampled and discrete values. The 
resulting table is in general composed by records in the form: [Entity_id, 
Sampling_Time, Discrete component 1... Discrete component n] and represents, for 
each entity, its time evolution in the state space. 

Let us define n“‘* as the number of entities that are in the state x, at sampling time k. 
Considering a pair of states that are contiguous in time, i.e. which are associated with 
two successive sampling times, the transition probability (1) can be expressed as: 

( 11 ) 



.(k,k+l) ^ ^ 



. .(k,k+l) 



n 



.(k) 



j 



LetI,^,={w,,W 2 ,...,Wj} the set of the states at stage w. Through equation (11) the matrix 



hk) 






is obtained, where iG represents the set of indexes relevant to the 

starting states and Jg the set of indexes of the arriving states. In general T® is not a 
square matrix because the number of starting states, that is the number of rows of the 
matrix, is generally different from the number of the arriving states, that is 
represented by the number of its columns. 

In general the probability p to reach a state Xj at a stage h from a state Xj at 

a stage k (k and h not contiguous) can be expressed, extending equation (4), as: 



h-l 






( 12 ) 



z=k 



where Cj is a row (transposed) vector having 1 in the i-th component and 0 in any 
other position and e j is a column vector with 1 in the j-th component and 0 anywhere 
else. 
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Table 6. The transition probability computed for the customer transactions example 





Tot„ 


t, 


Tot, 


Prob. 


30/1/98 


Very low 


30/2/98 


Very low 


0,75 


30/1/98 


Very low 


30/2/98 


Average 


0,25 


30/1/98 


low 


30/2/98 


Very high 


0,2 


30/1/98 


low 


30/2/98 


Average 


0,4 


30/1/98 


low 


30/2/98 


High 


0,4 


30/1/98 


High 


30/2/98 


High 


1 













Considering the example of the customer transactions, where only one state 
variable is analyzed, let us denote with t,, and t^ two successive sampling instants, with 
[Tot„] the state vector at sampling instant t„ and with [TotJ the same at tj. Then, the 
probability values that characterize the Markov chains can be computed according to 
(1 1) as represented in Table 6. 

Now, to give a measure of the statistic importance of the information resulting from 
this kind of analysis, the concept of support has to be introduced. The support is 
frequently used in data mining to evaluate the reliability of the association rules [13] 
[14]. 

Let us now develop a suitable definition to apply the concept of support to the 
Markov chain modeling technique here presented. The statistic importance of the 
transactions leaving from a state Xj at time k is expressed by the support of the 
considered state that can be defined as follows: 



c^(x-) = 




(13) 



In the following section a more detailed example of the proposed method is 
provided. 



4 Case Study: The Analysis of University Dropouts 

The case study that will be presented concerns an approach to the more general 
problem of the university dropouts. The data mining method previously described, is 
used to show implicit correlation between the different elements of a student state (the 
number of passed exams, the average mark, changes of residence and so on) and the 
decision to give up studying. 

In our case the goal consists in the extraction of typical patterns from data ending 
with drop out or degree, through data mining. 

The analysis of such patterns leads to identify the set of students who run the risk 
of dropping-out and therefore to determine high-risk situations in the students’ 
careers. 
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4.1 Description of the Problem 

In the above-described context the "entities" are simply university students observed 
for a period of twenty years (from 1978 to 1998). Therefore the data set includes 
either students who have already left the university or students who are still attending 
in 1998. 



Table 7. An example of the students’ personal data 



ID_code 


Matriculation_date 






1 


1/11/88 


20/4/94 




2 


1/11/89 




1/5/95 











The students' personal data are inserted in a table that reports, for each student, the 
matriculation date, the date of degree or the date of the first "non-enrollment" that can 
be considered as the dropout date. 

The exams passed by the students are considered as the "events" that characterize 
their curriculum; therefore a student's state is given by a three aggregated variables 
vector: the number of passed exams, the average mark and the student’s condition 
(attending/graduate/dropped-out) as coded in Table 9. 

Time here represents the distance from the matriculation date and it is sampled 
non-homogeneously to reflect only specific moments that are particularly significant 
during an academic year. 



Table 8. The state variable values for the students’ data 



ID_code 


Sampling time 


Number of passed 
exams 


Average mark 


Condition 

code 


1 


5 


1 


25 


A 












1 


72 


28 


26 


D 


2 


5 


2 


20 


A 












2 


24 


3 


22 


G 


3 


5 


1 


21 


A 













Table 9. The condition codes 



Description 


Condition code 


Attending 


A 


Graduate 


G 




D 
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The expression (6) for the drop-out case study and a given time horizon turns to be: 
f(e t,) = X , j = (1,. . .,n), n = number of observed students; 

t G T, T= <5, 12,..., 240> time instants vector; ^ ^ 

X G X, X = < passed exams, average mark, attending/graduate/drop out> 

The discretization step described by (10) is here achieved by considering suitable 
ranges for the average mark values to avoid an excessive state scatter and to maintain 
a sufficient support level. 



Table 10. Classes for the average mark 



Average mark 


Code 


27-30 


High 


23-26 


Medium 


18-22 


Low 



Let k and k+1 be two generic successive stages; then, let N, F and S represent the 
components of the state, respectively the number of passed exams, the average mark 
range and the condition code. Each state component obviously has values depending 
on the stage. The computation of the transition probabilities, performed through (11), 
is summarised in Fig. 1 . 



Stage k Stage k+1 




Fig. 1. The computed transition probabilities 

The final result of the computing transition probabilities process is a sequence of 
matrices T®, corresponding to the transition from states in the k-th stage to states in 
the (k-i-l)-th one. For each stage there are two absorbing states, one associated with 
degree (G) and the other with drop-out (D). 
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(k) 

For students still attending in a generic state x' '^it is possible to calculate the 

(k) (k) 

possibility to reach each of the absorbing states, Pj q and Pj q ■ The way to get the 

(k) 

two probabilities is similar. Taking for example Pj q : 

pS= 2 eMl'TW-ei, 

w=k+l z=k 



where e^ has 1 in correspondence to the absorbing state D and h, is the final stage of 
the time horizon considered. 

Obviously the following relation holds; 



Pi,D^Pi,G ^ 



( 16 ) 



The result of the application of Markov chains can be used, in the present context, 
to discover the sets of students with different risk degree of dropout, but it can also 
constitute the basis for further more accurate analysis of individual behavior. 

The first goal comes from the analysis of dropout probability for each intermediate 
state and then from the construction of clusters of students combined by the dropout 
risk level. Figure 2 is based on the data relative to the University of Genoa, Faculty of 
Engineering and it refers to the students that were attending University in the years 
between 1978 and 1998. The observed sample consists of about 15000 students 
relative to the twenty years' period of time under consideration. 
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Fig. 2. (A) Dropout probability and (B) number of attending students versus passed 
exams for the considered sample and relative to the 4* year at University. 
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The computation of the transition probabilities for each state (Figure 1) could be 
used to associate a probability value to the possible behavioral patterns that is the 
possible sequence of states that characterize the behavior of a subset of students. The 
evaluation of the most likely behavioral patterns starting from a state can be used to 
forecast the behavior of a single student that belongs to that state. 

Figure 2 represents in (A) the trend of the dropout probability corresponding to the 
number of passed exams while (B) gives a measure of the support, based on (13), of 
the probabilistic information provided above. 

A more complete representation of the experimental results is currently under 
development. One of such possible representations consists in the construction of 
dropout risk clusters. Such clusters can be useful to identify appropriate actions to try 
to influence the behavior of the dropout risk students and, as a consequence, the 
evolution of their careers. Being T® dependent on that actions, the fact that the 
transition probabilities T® could significantly change in the long range, can be 
inferred. In this case the model based on Markov chains could be used for planning 
and control. 

Another use of the proposed Markov chain based modeling technique is to improve 
the knowledge about possible future behavior of a single student. A better description 
of the dynamic behavior or of the structural characteristics of the student, in terms of 
state dimension, is in general useful to this aim. 

In the following section further improvements of the model will be presented in 
order to minimize the effects of the above mentioned problems. 

4.2 Entropy Driven State Space Expansion 

Results deriving from the use of Markov chain based modeling technique may imply 
local phenomena of uncertainty in terms of lack of discrimination power, particularly 
in the post-evaluation analysis of single cases. This means that in some “critical 
points” the scattering degree of probability may be particularly high. For these cases 

(k) 

the value of entropy function associated to a particular state X; 

E(xr)=-ip;,r’*i<>g.pir’ 

j 

can be used as a measure of uncertainty. 

Post-evaluation analysis generally implies some advisory activities and, when the 
value of entropy for a state exceeds an a-priori fixed threshold |T, it could be even 
more convenient for such an advisory function to apply some techniques that allow 
the refinement of knowledge and the reduction of the uncertainty. 

Such techniques may include, for example, a drill down procedure like the 
extension of the dimension of the state space (number of credits gained by the student, 
age of the student, current increment in the number of passed exams...) or the 
introduction of “memory” in terms of personal patterns associated to a given state. 
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5 Conclusions 

The behavior and the choices of an individual can often be referred to the behavior of 
the groups of people that statistically represent them. This paper defines an approach 
based on Markov chains to define clusters of people with a homogeneous behavior 
and to identify individual pattern that represent the behavior of the single component 
of the cluster. Such behaviors can be described through Markov chains as a series of 
transitions characterized by time. 

The proposed method has been applied to a case study concerning the problem of 
university dropouts. In such a context the proposed modeling technique can be used in 
order to define clusters of students associated with different dropout risk degree. 

Another use of the method concerns the analysis of the individual patterns in order 
to identify possible policies aimed at lowering the dropout risk levels. Then, in this 
sense, the proposed method can be used for planning and control activities. 
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Abstract. In this paper, we present SQL/LPP+, a temporal correla- 
tion verification language for time series databases. SQL/LPP-|- is an 
extension of SQL/LPP[6] and inherits its ability to dehne time series 
patterns. SQL/LPP-|- enables users to cascade multiple patterns using 
one or more of Allen’s temporal relationships, and obtain the desired 
aggregates or meta-aggregates of the composition. The issues of pattern 
composition control are also discussed. 



1 Introduction 

1.1 Motivation 

Discovering temporal correlation among events is a fundamental method for 
finding causality of events. The study of causality is considered to be one of 
the most important tasks for most natural and social scientists. Although there 
are many different approaches to identify cause and effect among events, the 
simplest and the most intuitive way is to observe historical data and check the 
likelihood that a set of events occurs in a particular temporal order. Since the 
occurrence of events is not predictable, continuous observation or measurement 
of the subject phenomenon is a common way to track events. For example, 
the Richter reading and stock indices need to to be continuously recorded in 
order to catch important seismic and economic events respectively. The result of 
these observation and measurement is time series data^ . Hence time series data 
is often the only scientihc ground on which to build theories. Events, in this 
representation, are time series patterns which possess some special features. 

We dehne temporal correlation as the likelihood that a given set of events 
occurs in a particular time period. To extract knowledge from data, domain 
experts hrst propose hypotheses about the temporal correlation of events, then 
develop a specialized program to verify their hypotheses. Temporal corre- 
lation verification(TCV) is the task of hnding temporal correlation among 
events. 

For example, we would like to know how likely DJIA will decline after a 
period of high interest rates, and how likely the total seasonal rainfall in Los 

^ If the observation is done at regnlar time steps, then the resnlt is regular time 
series. Otherwise, it is irregular. 
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Angeles will increase after a 3-month period of high sea water temperature. 
While establishing the real cause and effect relations among these events re- 
quires profound domain knowledge, we believe temporal correlation verihcation 
is a necessary part of the theory construction and researchers can beneht from 
utilizing temporal correlation verihcation languages. 

1.2 Motivating Examples 

Example 1. This example is a typical case in technical analysis of stocks[4]. A 
Double Bottom formation is commonly seen as a pattern that signals a market 
bottoms. A short-term stock trader might be interested to know what are the 
winning rate, average proht, maximal loss and total number of trades if he/she 
had simply bought a stock when he/she saw the pattern forms and held it for a 
hxed number of trading days. 

Example 2. Everyone in the stock market may be interested to know, based 
on hnancial history, how likely a downtrend of the DJIA lasts for more than 6 
months during periods that the 30-year T-bond yield was below 5.0%. 

By examining these example, it is easy to see that these problems have com- 
mon features. In each case, the events of interest are not individual records. 
Instead, they are represented by a group of records in continuous time inter- 
vals. The temporal correlation among the events can be verihed by counting 
the occurrences of the corresponding time series patterns. By current available 
technology, solving each problem requires tedious procedural program coding. 
Consider the fact these hypothesis tests are usually run only once or need fre- 
quent modihcation, the cost of this kind of knowledge extraction is fairly high. 
If there is a formal language that can express all these patterns, then domain 
experts no longer need to spend their valuable time in coding/debugging and 
can concentrate more on perfecting their theory. Computer scientists can also 
focus on improving the language and the execution efhciency instead of helping 
domain experts in a case-by-case manner. The need for a declarative language 
which permits fast formulation and execution of queries is obvious. 

1.3 The Problems 

In the aspect of data granularity, knowledge discovery in databases (KDD) and 
data mining[2] in time series differ from those in set-oriented data. As stated by 
Shatkay[7], 

Individual values are usually not important but the relationships between 
them are. 

It is continuous time series segments, instead of individual records, that represent 
events. Any attempt to design a time series data mining language must provide 
a way to dehne patterns that represent events. 

From the view point of KDD and data mining, Shatkay’s statement can be 
further improved to: 
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Individual time series segments are usually not important, but the cor- 
relationships between them are. 

While occurrences of patterns represent events, it is very hard for a human mind 
to learn anything from a large number of events. Another level of abstraction 
is required. For each of the motivating examples, the answer should consist of 
only one or a few numerical values instead of a listing of all related events. TAe 
purpose of TCV query languages is to provide that extra level of abstraetion. 



1.4 The Basic Idea 

Recognizing the importance of TCV problems, in this paper we attempt to gen- 
eralize the problem and to design a formal language that is expressive enough 
to address TCV problems in an unambiguous manner. The language we pro- 
pose, SQL/LPP-1-, serves as a problem dehnition language for TCV problems as 
VHDL serves for hardware description and XML for documentation. Although 
the LPP model[5] provides a fairly efhcient execution model, we do not exclude 
the possibility that future research can perform with greater efhciency. 

We believe that a temporal correlation verihcation language should provide 
users a way to specify: 

1. The (one or more) patterns of interest. 

2. The temporal coupling relationship(s) among them. 

3. The aggregates (statistics) of interest, which might be simple aggregates or 
meta-aggregates (Section 3.2). 

4. The control of pattern occurrence counting, that is, whether an occurrence of 
a (syntactically) preceding pattern should couple with only one or multiple 
occurrences of the following pattern. 

Our idea is to cascade pattern queries. Each pattern, with associated ag- 
gregates, is specihed in a syntactic order and is connected by a combination 
of temporal relationships. Then the system follows the syntactic order to Rnd 
occurrences of each pattern and update aggregates associated with them. The 
final value of the aggregates is the output of the verification. 



2 Defining SQL/LPP+ Patterns 

In this section, we briefly review the pattern defining capability that SQL/LPP-1- 
has inherited from SQL/LPP. For more details, please see [5,6]. 

SQL/LPP-1- patterns, like procedures, functions and triggers, are Rrst-class 
objects in time series databases. A simple SQL/LPP-1- pattern describes the 
properties of a single segment. In contrast, a composite SQL/LPP-f pattern is 
formed from multiple already defined patterns. 
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Simple SQL/LPP Pattern Declaration 

CREATE ROW TYPE quote( date datetime, price real, volume int ) 
CREATE TABLE daily_stocks( symbol lveirch 2 ur, quotes TimeSeries (quote)) 

The main body of a simple SQL/LPP pattern is a segment of a certain ele- 
ment type. A number of public attributes can be defined by the ATTRIBUTE- • IS 
clause. Pattern sentences and search directives are placed in WHERE and WEIGHTS 
clauses respectively. 

The following example demonstrates basic pattern declaration. 

Example 3. Consider an uptrend pattern in a daily stock price database to be 
a continuous period which satisfies the following two conditions: 

1. The closing price of each day, except the first day, is higher than the one of 
the previous day. 

2. The length of the period is at least 5 days. 

The pattern uptrend can be expressed as: 

CREATE PATTERN uptrend AS 

SEGMENT s OF quote WHICH.IS FIRST MAXIMAL, NON-OVERLAPPING 

ATTRIBUTE date IS last(s, 1) .date 

ATTRIBUTE low IS first(s, 1) .price 

ATTRIBUTE high IS last (s , 1) .price 

WHERE [ALL e IN s] ( e. price > prev(e, 1) .price) 

AND length(s) >= 5 

This pattern has three publicly accessible attributes, date, low and high. 
The attributes define the only part that other statements can access. The search 
directive FIRST MAXIMAL tells the search engine to report only the longest seg- 
ment once a group of adjacent answers are found. Another search directive 
NON-OVERLAPPING states that reported answers must not overlap with each other. 
The rest of the statement should be clear without explanation. 



Composite Pattern Definition A composite pattern is declared as a con- 
catenation of multiple non-overlapping patterns. The search directives of sub- 
patterns can be overridden by specifying new search directives. The following 
example demonstrates the use of composite patterns. 

Example 4- Assume pattern downtrend is defined symmetrically to pattern 
uptrend in Example 3. The pattern double. bottom consists of 4 trends as shown 
n Figure 1. The pattern has the following properties: 

1. The starting point is 20% higher than the local maximum. 

2. The difference of the two bottoms is less than 5% of the first bottom. 

3. The ending point is higher than the local maximum. 
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Fig. 1. Double-bottom pattern 



CREATE PATTERN double .bottom AS 

{downtrend pi; uptrend p2; downtrend p3; 
uptrend p4 WHICH.IS ALL, NON-OVERLAPPING} 
WHICH.IS NON-OVERLAPPING 
ATTRIBUTE date IS last(p4) .date 
ATTRIBUTE price IS last (p4) .high 
WHERE (pi. high > p2. high* 1.2) 

AND (abs (pi . low-p3 . low) < 0.05*pl.low) 
AND (p4.high > p2.high) 



3 Temporal Correlation Verification in SQL/LPP-|- 

3.1 Relative Relationships of Segments 

As indicated in [1], given 2 time series segments, there are 13 basic temporal 
relationships if we include all symmetric cases. SQL/LPP-f adopts them as the 
basic relationships in TCV problem. To transform these relationships to the form 
that cascading querying can apply, we define shadow functions. 

Definition 1. Given a time series segment <s[r,?/] and a temporal relationship 
R, the shadow function F is defined as r{ts[x, y], R) = {ts[x', t/']| i?(ts[x, j/], ts[x', ?/])}• 

The shadow functions of the basic temporal relationships are shown in Table 1. 

These relationships are used directly in SQL/LPP+ to specify the temporal 
coupling of two patterns. SQL/LPP+ allows users to combine two or more re- 
lationships by logical connectives AND and OR. The interpretation of composite 
relationships is: given a segment s and relationships R\ and R 2 , 
r{s, (Ri AND R2)) = F{s, Ri) n T(s, R2) and 
r(s, {Ri OR R2)) = r(s, Ri) u r{s, R2). 

Users must be aware that some combinations can result in an empty shadow and 
should be avoided. 

Allen’s 13 interval relationships and their combinations can represent any 
relative temporal relationships. However, an interesting question is how to rep- 
resent the relationship: pattern A is, at most 10 days and at least 3 days, before 
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pattern B. We introduce glue^ patterns to solve this problem. For example, we 
can define a glue pattern glue_3_10 as: 

CREATE PATTERI glue_3_10 on quote as 
SEGMENT s 

MHERE count (s) >= 3 AND count (s) <=10; 

Then we dehne a composite pattern C as the concatenation of A and glue_3_10. 
The relationship mentioned above can be simply represented as C meets B. By 
using glue patterns, users can also specify the length range of the overlapping 
part of two segments. 



# 


Relationship(R) 


Shadow Function r{ts[x,y], R)\ 


(1) 


before 


{ts[x', y'] 


y < x’} 


(2) 


meets 


{ts[x', y'] 


y = x'} 


( 3 ) 


left .overlaps 


{ts[x', y'] 


X < X ,x < y < y ] 


( 4 ) 


left .covers 


{ts[x', y'] 


X <x',y = y'} 


( 5 ) 


covers 


{ts[x', y'] 


X < X , y > y \ 


( 6 ) 


right .covered 


{ts[x', y'] 


X = x',y < y'} 


( 7 ) 


equal 


{ts[x', y'] 


X = x',y = y'} 


( 8 ) 


right .covers 


{ts[x', y'] 


X = x',y > y'} 


( 9 ) 


covered 


{ts[x', y'] 


X > X , y < y \ 


(10) 


left .covered 


{ts[x', y'] 


X > x',y = y'} 


(11) 


right .overlaps 


{ts[x', y'] 


x' < X < y' ,y > y'} 


(12) 


met 


{ts[x', y'] 


X = y'} 


(13) 


after 


{ts[x', y'] 


X > y'} 



Table 1. The shadow functions of Allen’s 13 Interval Relationships. By default, every 
relationship has x < y and x' < y' 



3.2 Aggregation and Meta- Aggregation 

A key design goal of SQL/LPP+ is to support summearization of occurrences of 
interesting pattern coupling. Aggregation on pattern occurrences serves the pur- 
pose of summarization. We introduce the syntax and semantics of SQL/LPP-1- 
aggregates in this subsection. 

The traditional dehnition of aggregation is just to Rnd the Rnal result of 
the aggregates. However, by observing the computation process, we can see that 
for any aggregate function / and time series t of length n, fi(t), ■ ■ ■ , fn{i) is 
again a sequence and is itself valuable information. So we can apply aggregate 
functions on this sequence again and construct an aggregate of aggregates. We 
call it met a- aggregate. 



^ The term Glue is borrowed from TpX[3]. 
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20-day moving average). The user wants to test a trading strategy called mov- 
ing average crossover: Whenever he/she sees ma5 cross over ma20, he/she buys 
100 shares the next day, holds it till he/she sees ma20 cross over ma5 then sells 
all the holding the next day. He/She would like to test the strategy and see how 
well would have the strategy worked on IBM stock. First, we have to create the 
pattern which represents the period of interest and find the entry and exit price 
of every trade. 

CREATE PATTERN crossover ON quote_ 2 uid_ma AS 

SEGMENT crsovr WHICH.IS MINIMAL, NON.OVERLAPPING 
ATTRIBUTE entry IS first (crsovr, 2) .price 

ATTRIBUTE exit IS last(crsovr, 1) .price 

WHERE f irst(crsovr, 1) .maS > f irst( crsovr, 1) .ma20 
AND last (crsovr, 2) .ma5 < last (crsovr, 2) .ma20 

The following SQL/LPP-I- code creates a test that summarizes the per- 
formance of this trading strategy. The aggregates we are interested in are the 
number of trades, the average profit of each trade, the m 2 iximal loss in a single 
trade and the maximal drawdown (the accumulated loss). 

CREATE TEST crsovr_prof it ON quote_and_ma AS 
{ PATTERN crossover crsovr 

ATTRIBUTE trades IS count () 

ATTRIBUTE avg_profit IS avg(crsovr. exit-crsovr. entry) 

ATTRIBUTE maix_single_loss IS max(crsovr.entry-crsovr.exit) 
ATTRIBUTE max_drawdown IS maxfsumfcrsovr.entry-crsovr.exit))) 

> 

REPORT *; 

Note that max^rawdown is defined by the meta-aggregate max{sum{.)). This 
test is a single-segment verification. So no temporal relationship is involved. The 
REPORT clause specifies which attributes should be reported. * is a shorthand for 
all attributes. 

The last part of the code specifies the tuple-level SELECT operation which 
extends SQL by adding a clause 

BY TESTING test_name test_alias IN time_series_f ield 

The main purpose of the following code is to specify what stocks to test and 
what attributes to report. 

SELECT qm. symbol, cp. trades, cp.avg_prof it, 
cp.max_single_loss , cp . m 2 UC_drawdown 
BY TESTING crsovr_prof it cp IN qm. quotes 
FROM quote_and_ma qm 
WHERE qm. symbol = "IBM" 
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Take max{sum{t)) for an example. We have: 



maxx{sum\{ty) = sum\{t) — ti 

maxi{sumi{t)) if maxi{sumi{i)) > swm,+j(<), i + 1 < n 
sumi^i{i) otherwise 



maa;,+i(swm,+i(t)) 






max{sum[i)) — maxn{sum„{t)) = MAX^-i |E<. 

u=i 



Figure 2 shows an example of max{sum{t)) calculation. 
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Fig. 2. An example of meta-aggregates max{sum[t)) calculation. Given the sequence 
t shown above. The result of max[sum{t)) = 12. 



Meta-aggregates cannot be constructed arbitrarily. For example, max(t) -|- 1 
is not a valid aggregate expression. To define meta-aggregates, we start from 
simple expressions. Simple expressions are formed by constants and pattern 
attributes, and complete under arithmetic operations. For example, a.ssume p is 
a pattern alias and ci and C 2 are attributes of p, then p.ci, p.ci * p.C 2 -h 3 and 
(p.ci — p.C 2 )/p-c\ are simple expressions. 

In this paper, we only discuss 5 aggregate functions: count, sum, avg, min 
and max. Other aggregates can also be defined in the similar way. Assume exp 
is a simple expression, / is an aggregate function, aggregate expressions are 
defined as: 

1. Constants and f{exp) are aggregate expressions. 

2. If E is an aggregate expression, then f{E) is also an aggregate expression. 

3. If El and E 2 are aggregate expressions, then E\ 0 E 2 is also an aggregate 
expression where 0 is an arithmetic operation. 

For excimple, min{avg{p.ci)), max{p.ci)—min{p.ci) and avg{max{p.ci)—min{p.C 2 )+ 
4) — 2 are aggregate expressions but p.ci and avg(p.Ci) + p.C 2 are not. 



3.3 Single- segment SQL/LPP-|- Test Cases 

In rest of this section, we present the SQL/LPP-b language by a few examples. 
First, we demonstrate the aggregation and meta-aggregation of SQL/LPP-f . 

Example 5. Assume a user has constructed a time series view which contains 
price(the daily closing stock price), ma5(the 5-day moving average) and ma20(the 
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20-day moving average). The user wants to test a trading strategy called mov- 
ing average crossover: Whenever he/she sees ma5 cross over ma20, he/she buys 
100 shares the next day, holds it till he/she sees ma20 cross over ma5 then sells 
all the holding the next day. He/She would like to test the strategy and see how 
well would have the strategy worked on IBM stock. First, we have to create the 
pattern which represents the period of interest and hnd the entry and exit price 
of every trade. 

CREATE PATTERI crossover 01 quote_and_ma AS 

SEGMENT crsovr MHICH_IS MINIMAL, NON_OVERLAPPING 
ATTRIBUTE entry IS first(crsovr,2) .price 

ATTRIBUTE exit IS last (crsovr , 1 ). price 

MHERE first (crsovr , 1 ) .maS > first (crsovr , 1 ) .ma20 
AND last (crsovr, 2) .maS < Iast(crsovr,2) .ma20 

The following SQL/LPP-1- code creates a test that summarizes the per- 
formance of this trading strategy. The aggregates we are interested in are the 
number of trades, the average proRt of each trade, the maximal loss in a single 
trade and the maximal drawdown (the accumulated loss). 

CREATE TEST crsovr_prof it ON quote_and_ma AS 
{ PATTERN crossover crsovr 

ATTRIBUTE trades IS count () 

ATTRIBUTE avg_profit IS avg(crsovr. exit-crsovr. entry) 

ATTRIBUTE max_singIe_Ioss IS max(crsovr . entry-crsovr . exit) 
ATTRIBUTE max_drawdown IS max(sum(crsovr. entry-crsovr . exit) ) ) 

} 

REPORT 

Note that max_drawdown is defined by the meta-aggregate max(sum(.)). This 
test is a single-segment verification. So no temporal relationship is involved. The 
REPORT clause specifies which attributes should be reported. * is a shorthand for 
all attributes. 

The last part of the code specifies the tuple-level SELECT operation which 
extends SQL by adding a clause 

BY TESTING test_name test_alias IN time_series_f ield 

The main purpose of the following code is to specify what stocks to test and 
what attributes to report. 

SELECT qm. symbol, cp. trades, cp . avg_prof it , 
cp . max_singIe_Ioss , cp . max_drawdown 
BY TESTING crsovr_prof it cp IN qm. quotes 
FROM quote_and_ma qm 
MHERE qm. symbol = "IBM" 
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When the SELECT statement is issued to a SQL/LPP+ system, the system 
will hrst hnd the record that contains IBM stock price data, then search the 
occurrences of crossover and calculate the value of attributes for output. 

Suppose there is a daily stock price database which contains quote data 
spanning 20 years. There are roughly 5000 records and 12, 502, 500 segments for 
each stock. This example is a demonstration in which only 4 quantities, the only 
things matter to a trader, are extracted from all the information. 



3.4 Multi- segment SQL/LPP+ Test Cases 

In this subsection, we discuss how to dehne multi-segment tests in SQL/LPP-1- 
and the use of temporal relationships. 

Example 6. The following code dehnes a test to verify whether the famous Dou- 
ble Bottom pattern is really a proRtable signal in stock trading. 

CREATE TEST db_profit 01 quote AS 

{PATTERI double_bottoms db 
ATTRIBUTE db_count IS count () 

ATTRIBUTE max_entry_price IS max (db. price) 

} 

MEETS 

{SIIGLE PATTERI uptrend ut 
ATTRIBUTE ut_count IS count () 

ATTRIBUTE avg_profit IS avg(ut .high-ut . low) 

ATTRIBUTE winning_rate IS up_count/db_count } 

REPORT winning_rat e , max_entry , avg_prof it ; 

This segment of SQL/LPP-1- defines a test case named db .profit on a time 
series of quote type. In the BY TESTIIG clause, each pattern is described in 
a block delimited by { and }. The first pattern to be searched for is the 
double Jbottom pattern. It is given an alias ‘ ‘db’ ’ . The ATTRIBUTE line describes 
the aggregates to calculate, count ( ) keeps the total number of occurrences of this 
pattern found by the system. avg(db. price) calculates the average of the value 
in the price held of every found occurrence. Between the pattern blocks, MEETS 
is the temporal relationship of the patterns to be verified. In the next pattern 
block, the keyword SIIGLE denotes that for each occurrence of the preceding pat- 
tern, doubIe_bottoms, the system is to find at most one uptrend. (If MULTIPLE 
were placed here, then for each occurrence of doubIe_bottoms, the system would 
find as many uptrends as possible while subject to the restart search directives 
specified in uptrend.) The first pattern does not need the SIIGLE/MULTIPLE 
directive and is always controlled by its own search directives. Each time an 
occurrence of a pattern is found, only the attributes in the block of that pattern 
are updated. The REPORT clause lists the attributes to report. 

The tuple-level SELECT sentence is omitted because it is very similar to the 
one in the previous example. 
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The execution control is intuitive. Pattern searching follows the syntactic 
order in which patterns are specihed. When the search engine hnds an occurrence 
of the i-th pattern, it computes the shadow of the segment according to the 
temporal relationship specihed between the i-th and the {i + l)-th patterns and 
attempts to hnd the occurrence of the {i + l)-th pattern. If it can not hnd one, 
it tries to hnd next occurrence of the i-th pattern only if MULTIPLE is specihed 
for the i-th pattern. 

4 Conclusion 

In this paper, we have presented SQL/LPP+. The language provides an intu- 
itive temporal coupling notation to specify combinations of Allen’s 13 temporal 
relationships. We have also extended the concept of aggregation and introduced 
meta-aggregation. With meta-aggregation, SQL/LPP-b users can obtain not only 
the aggregate value of the time series but also the aggregates of aggregates. We 
believe meta-aggregation is essential in time series data mining. 

Currently, temporal correlation verihcation is a rarely touched area in KDD 
and data mining. Previous work is isolated in each application domain. In this 
paper, we have shown that many problems, in domains from natural science 
and social science to stock trading, all have the same problem structure. The 
contribution of this paper is to provide a level of abstraction on these problems 
by proposing a language that can formulate the problems of interest. With a 
TCV language like SQL/LPP-b, it becomes possible to simultaneously beneht 
researchers in numerous Reids by improving the language and the efficiency of 
test case evaluation. 
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Abstract. Following the paradigm of on-line analytical processing (OLAP) 
every representation of business objects in management support systems 
is multidimensional. Dynamic changes of business structures like con- 
solidations have to be modeled in the data warehouse framework. For 
reasons of consistency in analytical applications it is necessary to add 
temporal components to the data model. Objects and relations between 
objects will be provided with time stamps corresponding to known meth- 
ods of temporal data storage. This enhancement of the OLAP-approach 
allows even after changes of structural data (dimensions) an appropri- 
ate comparative analysis between arbitrary periods. But any access to 
multidimensional cubes make it necessary to evaluate a meta cube. 
Keywords. 

Multidimensional databases, data warehouse, management support sys- 
tems, temporal data, OLAP 



1 Introduction 

The development of management support systems is characterized by the cyclic 
up and down of buzz words. Model based decision support and executive infor- 
mation systems were always restricted by the lack of consistent data. Nowadays 
data warehouses try to cover this gap by providing actual and decision relevant 
information to allow the control of critical success factors. This is not only a 
snapshot of operational performance but also a view on the time series of rele- 
vant variables and parameters. Therefore the use of time stamps is crucial for 
data warehouse application. This paper provides concepts of temporal databases 
for management support systems that will lead to approaches of data warehouse 
architectures. 



2 Temporal Data 

In ’’temporal databases” we store versions of objects which are tracking the 
evolution of these objects over a period of time. The attributes of time of these 
versions are called time stamps. They mark a special point or interval on the 
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time axis. In this context we only consider time as a series of discrete time units. 
The granularity of time measurement depends on the application area. We call 
a chronon the smallest relevant time unit which is atomic [JDBC98, 376] so that 
the time axis could be interpreted as a series of chroni. As most objects do not 
alter their attributes in every chronon we prefer to stamp these objects with 
time intervals. 

Temporal information systems distinguish between valid time and transaction 
time. A valid time of a fact (instance of an object) gives the interval of time in 
which the observed object keeps a constant state. The transaction time (point 
in time) of a database is the time when a fact is current and retrievable in the 
database [JDBC98, 371]. Databases with either valid times or transaction times 
are called temporal databases [JDBC98, 375]. 

There are two temporal enhancements of RDBMS: time stamps of attributes 
or time stamps of tuple. Within the attribute stamping every time-dependent 
attribute gets a time stamp for the beginning and the end of the period of 
validity. This violates the INF because we insert many versions of one attribute 
in a tuple, whereas the tuple stamping avoids this effect by providing two special 
attributes ’’beginning of period” and ’’end of period” in every tuple. Nevertheless 
we have to copy every tuple even when there is only a change in one attribute 
which gives a high redundancy in the database. Otherwise we use a temporal 
normalization to avoid these redundancies. In general we have to consider valid 
times and transaction times in temporal databases. The use of the one or the 
other dependens on the application case. 



3 Models of Temporal Data Warehouses 

Almost all implementations of temporal databases are OLTP oriented. Little is 
known about the handling of evolutionary structures [RoCR94, 138] in analytical 
information systems like executive information systems or OLAP applications. 
Here we do not manipulate atomic data items in a transaction system but analyse 
complex dynamic data structures as to find in corporative consolidation trees. 



3.1 Fundamentals of Data Warehousing 

A data warehouse is a concept for a corporative data storage where a common 
business semantic meets consistent information that is of potential relevance for 
decisions. Its application field is the management task of planning, analysing 
and controlling the companies key processes. Inmon [Inmo96, 33] defines a data 
warehouse as ”... a subject oriented, integrated, non-volatile, and time variant 
collection of data in support of management’s decisions.” 

The special challenge in building a data warehouse is to deliver decision 
relevant information in time from internal and external sources, to store multi- 
dimensional business objects in an efficient way and to present the information 
for an intuitive use with a high performance [ChG198]. Periodic updates pump 
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cleaned, compressed and enriched data from OLTP systems into a data ware- 
house where in respect to a conceptual data model the multidimensional objects 
are redundantly stored. All data items in a data warehouse are time dependent. 
This results from the permanent shift of states in the operational database and 
the periodical transfer of historical data into the archives. Another aspect lies 
in the scheduling of the data import into the warehouse. But this fact only de- 
termines the limit for values of a chronon. A chronon has to be smaller than the 
update cycle. 



Produced goods 




Fig. 1. OLAP cube 



Multidimensional database design needs approaches for modeling dimensions 
(sets of logically connected elements) with aggregation and disaggregation op- 
erators to build structures within the dimensions. Each business object will be 
described by these structures (dimensions) and by quantities (facts). The basic 
attribute of a business object is the dimension of time (periods) as in all business 
contexts the time dependency is obvious. 

In Fig.l we see a multidimensional cube that is opened up by sets of dimen- 
sions (factory, time, product) . The quantities will be imported with reference to 
the mapping and the synchronization from the OLTP databases. One can no- 
tice that each component of a data warehouse (structure and facts) is temporal. 
Therefore it is necessary to establish the time relation on every level of data 
modeling. 



3.2 Multidimensional Modeling 

While modeling a multidimensional data cube the inherent dimension of time has 
to be fixed in its granularity. This granularity of time in most cases is dependent 
on the chroni of the underlying operational information systems. OLAP modeling 
and temporal databases have the definition of granularity of time in common. 
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We will explain the temporal extension of multidimensional databases by the 
above mentioned example shown as OLAP cube in Fig.l. 

The chosen granularity of time in this example is MONTH. That means that 
the valid time for the version of an object is explicitly expressed by the dimension 
TIME. The transaction time is irrelevant since the attributes of any object stored 
in the data warehouse must not be altered by definition. This means that we 
only store historical facts in the OLAP cube which are periodically inserted and 
stamped by the dimension TIME. 

It is much more relevant to consider the change of structures relating to 
time. Structural data are the elements and hierarchies of dimensions which span 
the cube. Each hierarchy basically consists of relations between father-nodes 
and son-nodes in a tree. There can be multiple hierarchies on every dimension. 
Consider the dimension PRODUCTS with a tree built by distribution regions 
DRO, DRl, DR2 where the products are sold. 

Now, assume that we distribute our products P1,..,P4 in period Ml and M2 
in the regions DRl and DR2 but that we have two changes. In period M2 we will 
produce and distribute in region DR2 a new product called P5. After a month 
in M3 we restructure our distribution system and create a new region DR3 in 
which products P4 and P5 should be distributed. A new aggregation tree will be 
valid for period M3 (Fig. 2). In management support systems there is an urgent 




Fig. 2. Multiple time stamped consolidation trees 



need for exception reporting and comparison of historical and actual data. The 
evolution of structures in business organizations must therefore be kept in data 
warehouses. A straight forward idea to store the valid times of structures is to 
give every component in a hierarchy (tree) an attribute time stamp. This can 
be done by stamping all historical relations in a consolidation tree (Fig. 2). Then 
we can read the tree for every period of its life span. The time intervals give 
the validity of every consolidation-relation with a left closed and right open time 
interval (n short for Mn). So we see the emerging of product P5 in month M2 
distributed in DR2 and the shift of P4 and P5 to region DR3 in month M3. 
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As the elements of a dimension are used to describe different OLAP cubes we 
can not give these elements a global time stamp. This is also true for the nodes of 
the different trees. But when we provide only time stamps on the edges of a graph 
we have to analyse every consolidation path before an OLAP access. Another 
way of storing different structures is the tuple time stamping. Concerning the 
example we decompose the tree in time relations (nodel; node2; [beginning of 
valid time, end of valid time[). Actually in this special case of consolidation trees 
we see no difference between attribute and tuple time stamps. 

The implementation of either attribute or tuple time stamps should be done 
by matrices of valid times as is shown in Fig. 3. 
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Fig. 3. Matrix of valid time stamps 



Each row and every column of the quadratic matrix is described by the nodes 
of the consolidation tree. Every existing father-son-relation obtains a time stamp 
as entry in the table. In the shaded cells we see the derivated valid times of the 
leaves and the root of the considered tree. The arrows in Fig. 4 stand for the 
union of all contributing time intervals. An inspection of the table gives the 
structure of a dimension and the changes of this structure relating to time. 

For all structural information in form of consolidation trees we need a matrix 
representation. This technique is necessary not only for diverse trees in one 
or more dimensions of a special cube but also for different cubes (neglecting 
the hypercube approach). This fact leads to the idea of multidimensional meta 
cubes where the structure of a complete OLAP system is stored. Beside all used 
dimensions except the dimension TIME (time itself is not time valid) a meta 
dimension cubes is introduced so that every single dimension structure in every 
cube can be stored with its time stamps in the meta cube. 

Any access to a temporal multidimensional database (see Fig. 4) by OLAP- 
queries must use the information of the meta cube. Following a drill down into 
a cube for an individual time period each related consolidation tree has to be 
taken in account. The advantage of this concept is an easy support of all data 
analysis which are time related. So we can compare historical data (facts) or 
actual data on a basis of time different consolidation structures. The option of 
valid time stamping should be added to classical OLAP architectures to provide 
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Fig. 4. Consolidation tree and fact table 



not only time-dependent facts but also the access to evolutionary changes of 
business structures. 



4 Conclusion 

It is shown that a data warehouse generally needs temporal components to model 
structural data. The special architecture of data warehouses has a strong impact 
on the use of temporal extensions in databases. Only few concepts of temporal 
relational databases are transferable. To store time dependent structural data 
we suggest the use of meta cubes wherein global time information about every 
relation is available. In summary you can say that every data warehouse supports 
valid times for facts but seldom for stored business structures or rules. This paper 
offers the extension of concept models and implementations of a data warehouse 
towards a temporal database has been shown in this paper. 
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Abstract. Based on a real-life problem, the target group selection for a bank’s 
database marketing campaign, we will examine the capacity of Neuro-Fuzzy 
Systems (NFS) for Data Mining. NFS promise to combine the benefits of both 
fuzzy systems and neural networks, and are thus able to learn IF-THEN-rules, 
which are easy to interpret, from data. However, they often need extensive 
preprocessing efforts, especially concerning the imputation of missing values 
and the selection of relevant attributes and cases. In this paper we will 
demonstrate innovative solutions for various pre- and postprocessing tasks as 
well as the results from the Nefclass Neuro-Fuzzy software package. 

Keywords. Database Marketing, Data Mining, Missing values, Neuro-Fuzzy 
Systems, Preprocessing 



1 Introduction 

Companies doing database marketing experience target group selection as a core 
problem. At the same time they are often confronted with a huge amount of data 
stored in their data banks. These could be a rich source of knowldege, if only 
properly used. The new field of research, called Knowledge Discovery in Databases 
(KDD) aims at closing this gap by developing and integrating Data Mining 
Algorithms, which are capable of ‘pressing the crude data coal into diamonds of 
knowledge’. In this case study we describe how to support a bank’s new direct 
mailing campaign based on data about their customers and their reactions on a past 
campaign. The database consists of 186.162 cases (656 of them being respondents 
and the rest non-respondents) and 43 attributes, e.g. date of birth, sum of transactions 
etc., as well as the responding behaviour. We will describe how Neuro-Fuzzy 
Systems can be used as Data Mining tools to extract descriptions of interesting target 
groups for this bank. We will also show which preprocessing and postprocessing 
steps are indispensable to make this Neuro-Fuzzy Data Mining kernel work. 
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2 Neuro-Fuzzy Systems as Data Mining Tools 

Neuro-Fuzzy Systems (NFS) represent a new development in the field of Data 
Mining tools. They promise to combine the benefits of both fuzzy systems and neural 
networks, because the hybrid NFS-architecture can be adapted and also interpreted 
during learning as well as afterwards. There is a wide variety of approaches summed 
up under the term ‘Neuro-Fuzzy Systems’ [1]. Here we focus on a specific approach 
called Nefclass (NEuro Fuzzy CLASSification), which is available as freeware from 
the Institute of Information and Communication Systems at the University of 
Magdeburg, Germany (http ; / /fuzzy . cs . uni-magdeburg . de /welcome . html). 
Nefclass is an interactive tool for data analysis and determines the correct class or 
category of a given input pattern. A fuzzy system is mapped on a neural network, a 
feed-forward three-layered Multilayer Perceptron. The (crisp) input pattern is 
presented to the neurons of the first layer. The fuzzification takes place when the 
input signals are propagated to the hidden layer, because the weights of the 
connections are modelled as membership functions of linguistic terms. The neurons 
of the hidden layer represent the rules. They are fully connected to the input layer 
with connection weights being interpretable as fuzzy sets. A hidden neuron’s 
response is connected to one output neuron only. With output neurons being 
associated with an output class on a 1:1 basis, each hidden neuron serves as a 
classification detector for exactly one output class. Connections from hidden to output 
layer do not carry connection weights reflecting the lack of rule confidences in the 
approach. The learning phase is divided into two steps. In the first step, the system 
learns the number of rule units and their connections, i.e. the rules and their 
antecedents, and in the second step, it learns the optimal fuzzy sets, that is their 
membership functions [2]. 

In recent years various authors have stressed the benefits of using these NFS for Data 
Mining [2]. The effectiveness of these algorithms is empirically proven for small files 
like the iris data (see e.g. [3] or [4]). But when it comes to analyzing real-life 
problems, their main advantages, the ease of understandability and the capability to 
process automatically large data bases remains a much acclaimed desire rather than a 
proven fact, though. Regarding classification quality, in a previous study based on a 
real-life data file and criteria [5], we have found Neuro-Fuzzy Systems as good as 
other algorithms tested. Their advantages are ease of interpretation (fuzzy IF-THEN 
rules ) and their ability to easily integrate a priori knowledge to enhance performance 
and/or classification quality. However, two severe problems may jeopardize Neuro- 
Fuzzy Data Mining, especially for large databases: 

- Inability to handle missing values, which is especially true for the NEFCLASS 
product considered. All cases with missing values have to be excluded. In datasets 
where missing values are common, this brute force approach bears a high risk of 
altogether deleting relevant learning patterns. 

- Run time behavior can be most annoying when working on large databases. More- 
over the interpretability of the resulting rule set decreases with many input 
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variables and thus many rules, because a lot of NFS, e.g. Nefclass, are not able to 
pick the relevant attributes. 

It has turned out that pre- and postprocessing efforts are indispensible for Neuro- 
Fuzzy Systems (like most Data Mining algorithms). The Knowledge Discovery in 
Databases paradigm as described by [6] offers a conceptual framework for the design 
of a knowledge extracting system that consists of preprocessing steps (data selection; 
imputation of missing values; dimensionality reduction), a Data Mining kernel and 
postprocessing measures. For each step we have to develop tools to solve the 
problems mentioned in an integrated fashion, which means that they should guarantee 
an uninterrupted data flow and allow to use experts’ background knowledge. 

Next, we will show, how we have implemented the different units of the KDD 
process by resorting to existing approaches and developing own solutions. We will 
demonstrate the efficiency of this system by a target group selection case study. 



3 Data Selection 

To promote a new product, one of Germany's leading retail banks had conducted a 
locally confined but otherwise large mailing campaign. To efficiently extend this 
action to the whole of the country, a forecast of reaction probablilty based on 
demographic and customer history data is required. Eliminating data items that are 
low on reliability and/or widespread and consistent availability, we end up with 28 
independent variables. Next, cases with extreme value constellations were identified 
as outliers using standard descriptive procedures and eliminated. 



4 Imputation of Missing Values 

Missing values often occur in real life databases and can cause severe problems for 
Data Mining. This applies to Data Warehouses in particular, where data are collected 
from many different, often heterogeneous sources. Different solutions have been 
developed in the past to solve this problem. First, one can delete all cases with at least 
one missing value (listwise deletion). But this leads to a great loss of information [7]. 
Secondly, one can use different methods of parameter estimation, especially by means 
of maximum likelihood functions [8]. These algorithms are very time consuming 
because of exhaustive estimations and thus are running against the Data Mining 
philosophy. The most promising approach, however, is to explicitly impute the missing 
data. Here, we can distinguish two prime directions according to the way the 
redundancies in the data base are used: 

- Approaches that try to impute the missing valus by setting up relationships 
between the attributes. These relationships (regressions or simple ratios), computed 
over the complete cases, and the known values of an incomplete case are used to 
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calculate an appropriate imputation value. The most famous algorithm here is the 
regression based imputation by [9]. 

- Algorithms that use the corresponding value of the most similar complete case for 
imputation. As this nearest-neighbour-approach proves to be computationally 
inefficient, one often clusters cases into groups and uses cluster centers in place of 
cases to carry out similarity and distance calculations. 

Although this last approach seems to be very attractive, it is still very time consuming 
in the clustering step. Another way we have tried, is to use the infomation of a 
decision tree for imputation. This decision tree is developed by the C5.0 algorithm 
[10], which is by far more efficient than any clustering algorithm. The basic idea is as 
follows: C5.0 analyses a discrete dependent variable and builds a decision tree from 
the influence set, using variables with strongest discriminating power for the top 
splits. In the leaves of the tree the number of cases belonging to each class may be 
recorded. Theses numbers should ideally be close to 0 or 100% For an example see 
figure 1 with 22 cases described by two attributes found relevant, s.l. profession and age 
group. We interpret the leaves of the tree as homgeneous groups of cases. To put 
widespread experience in a nutshell, this method can efficiently and robustly model data 
dependencies. When building the tree the algorithm can also utilize cases with missing 
values by making assumptions about the distribution of the variable affected which leads 
to the fractional numbers in the leaves, as shown in Figure 1 [10]. 




Fig. 1. C5. 0-decision-tree based imputation. 



To use this tree for imputation we first have to compute a quality score for each 
eligible imputation value (or imputation value constellation in case more than one 
attribute is missing) which will be based on the similarity between the case with the 
missing data and each path down the tree to a leaf. This is done by an Approximate 
nearest neighbour approach (ANN): The ANN method computes similarity as the 
weighted sum of the distance at every knot (top knots receive a higher weight) 
including a proxy-knot for the class variable to generate absolutely homogeneous end 
leaves. Finally, this score is multiplied by a „significance value“, too. 

Secondly, we have to identify the best imputation value (constellation) by the 
quality scores of all possible imputation values. To visualize this problem one can 
draw a (nH-l)-dimensional plot for a case with n missing values (see Figure 2 a 
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displaying our score as a function of value constellations). To find a suitable 
imputation value set different methods are possible, which we have borrowed from 
the defuzzification strategies in fuzzy set theory. 

Center of Gravity approach'. Here, the imputation value(s) correspond to the 
attributes’ values for the center of the quality plot (In Figure 2 a) this is (2,3) for 
attribute 1 and 2 respectively). But this approach does not seem suitable for reasons 
already discussed in fuzzy set theory [11]. 

Maximum approach: This method uses the constellation with the maximum quality 
score for imputation. This will not always yield unique results, because the decision 
tree does not have to be developed completely. Hence, some variables might not be 
defined. In Figure 2 a) for example, attribute 2 will not enter the tree if attribute 1 
equals 3. There are several solutions. We can use the best constellation that is 
completely defined (Single Max) ((1,4) in Figure 2 a) or do an unrestricted search for 
the maximum and in case this constellation is not unique, use a surrogate procedure 
on the not-yet-defined attribute(s). The surrogate value can be the attribute's global 
mean (Global Max Mean) (leading to (3,2) for the example in Figure 2 a) or the value 
with the highest average quality, the average being computed over the defined 
variables (Global Max Dice) ((3,4) in Figure 2 a). Applied to the case of Fig. 2a) this 
means averaging over variable 1 to get the (marginal) quality distribution displayed 
in Figure 2 b). Based on this „plot 11“ we calculate the surrogate value, shown in 
figure 2 c), as follows: 
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Fig. 2. Quality plot and defuzzification by Global Max Dice. 
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In essence, this amounts to using the marginal score distribution of variable 2 to 
modulate branches where the decision tree is not fully developed. This definition of a 
surrogate value is not so essential, because the fact, that the decision tree algorithm 
did not take into consideration that attribute shows its small discriminating power. 
But the maximum approach has a distinctive benefit, because it allows some kind of 
multiple imputation [12] without really imputing different values for one missing data 
by a simple extension of the method. For each missing value constellation, the n best 
value tuples are stored together with their quality scores (n can be chosen freely, n=l 
corresponds to single imputation) and choose one of them randomly according to its 
quality score. In doing so we avoid the often criticized reduction of variance inherent 
to ordinary, deterministic imputation methods [12]. In SPSS for example, this 
problem is solved by artificially adding variance enlarging noise, whereas the method 
explained above seems much more elegant and appropriate. 

In our case we had 4 attributes with missing data in 0.3%, 27.5% , 50.7% and 86.4% 
of all cases. Therefore, we would have lost 95% of all cases if using the listwise 
deletion technique. This would lead to a very high loss of information, especially 
because all but nine (=1.4%) of all respondents would have to be dropped. The 
traditional imputation algorithms build their models only on the complete cases. But 
only 5% of all cases are complete; thus we can hardly trust the regression parameters 
or clusters. The C5. 0-algorithm also uses cases with missing data, although this is 
done in a very brute force way. This is why we have chosen this method. We have 
used Global Max Dice for defuzzification with n=3, i.e. multiple imputation. 



5 Dimensionality Reduction 

Neuro-Fuzzy Systems like Nefclass, which are not able to identify the most relevant 
attributes for the rules, suffer from a combinatorical explosion in both run time and in 
breadth and depth of the rule base. Therefore a selection of the relevant attributes is 
inevitable [13]. For this purpose, we have different methods at hand. On the one 
hand, by the search algorithm employed, we can distinguish heuristic (e.g. forward or 
backward selection [13], Relief algorithm [14]), complete (e.g. FOCUS [15]) and 
stochastic methods (e.g. by genetic algorithms [16]) [17]. On the other hand, different 
approaches use different evaluation functions. In the wrapper approaches each subset 
of attributes is evaluated based on the classification quality of the target Data Mining 
algorithm [18]. One common feature of these approaches is that they are very time 
consuming and not appropriate for Data Mining, whose idea is to process large 
amounts of data in a short time. Another method is the filter approach. Here, the 
quality of the selected attributes is evaluated by some form of heuristics different 
from the target Data Mining algorithm with the hope of capturing discriminating 
quality in a function that is much easier to evaluate. Examples are the Relief 
algorithm or FOCUS [15]. Their shortcoming is that the evaluation function’s bias 
may differ from the classifier’s bias. To sum up, we can state an efficiency- 
effectiveness-dilemma, proving none of the single solutions to be optimal. 
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To overcome this dilemma, we have developed a model combining different 
approaches in a stepwise procedure [19]. In each step we eliminate cases with methods 
that are more sophisticated, but also less efficient than the method in the predecessor 
step. The choice of methods in each step depends to a large extend on the data situation. 
In our first prototype application example we have used the following methods to study 
the feasability of our general approach. The first step follows the filter idea. In three 
substeps we have stepwise eliminated attributes by different methods. 

- First of all, to eliminate irrelevant variables, we have looked at variables’ entropies 
and the interrelationship of each independent attribute and the output variable. The 
entropy as a measure for the information conveyed by a discrete-value attribute is 
(in the case of binary data) computed according to the well-known formula: 

entropy; = - (p;o * ld(p;o ) H- p;; * ld(p;i )) . (2) 

Pio’Pii = probability of 0 / lin variable i. 

Based on the entropy we have computed a so-called e-value as e = 1 - entropy. The 
idea is to eliminate attributes with a high e-value, i.e. attributes with low 
information. Rules based on these variables might be correct, but their support is 
too small. In addition we performed a %^-test for independence of the output 
variable from each potential influence (considered in isolation). A so-called 
attribute’s u-value is computed as the 2-sided p-value for this test. The idea is to 
rule out attributes, which have a u-value higher than a specific significance level, 
i.e. which have no significant correlation with the output variable. However, this 
does not raise a claim on statistical profoundness, but serves as a heuristic clue. 
Moreover, as univariate analyses will not take supression and related interaction 
effects into account, this procedure is heuristic rather than rigid. Based on both 
values we have calculated a so-called del-value as a measure of how reliably a 
variable can be deleted and which allows for compensation between u- and e-value 
by defining del= u*e and use a threshold for this (here 0.19=0.02*0.95). By doing 
so we have managed to delete 6 irrelevant attributes. 

- In the second substep, we have clustered the remaining attributes via a complete 
linkage clustering (with correlation serving as a proximity measure) and eliminated 
3 redundant variables. Doing a factor analysis instead will result in even lower 
information loss at the price of interpretability. 

- In the third substep, we have used a C5.0 decision tree to identify the attributes 
with the highest information power. They are located in the first layers of the tree. 
Here, we have used the first five layers und could eliminate 6 variables not 
appearing in this layers and thus being just weakly relevant. Again this is a very 
heuristic but very fast procedure. 

Neuro-Fuzzy methods may now start. The second data compression step now consists 
of a backward elimination with wrapper evaluation. We have eliminated attributes in 
a stepwise way until the classification quality of the target Neuro-Fuzzy System 
derived has decreased significantly. This is, of course, a very time consuming 
method. But it is inevitable for the final selection of the attributes. Due to the 
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preceeding heuristic selection and thinning out steps this wrapper approach has been 
feasible in this stage now. 

To reduce the run time we do not only have to select the relevant attributes but also 
select the relevant cases. A small sample should be drawn containing all prototypes 
necessary to cover the input space. Previous studies have also shown that Nefclass 
needs a balanced sample with respect to classes. Therefore we have drawn a training 
sample by pure random sampling consisting of about 300 respondents and 300 non- 
respondents. Unfortunately, this file was very small due to the small amount of 
respondents. This might lead to a high degree of randomness in the sample quality. 
But sensitivity analyses have shown that this was not the case. We have also tried 
more intelligent sampling methods, such as stratified sampling or jack-knifing. But 
this did not lead to better results. 



6 Data Mining with NEFCLASS 

Nefclass learns the number of rule units and their connections, i.e. the rules. If the 
user restricts the maximum number of rules, the best rules are selected according to 
an optimisation criterion [1]. This maximum rule number and the number of fuzzy sets 
per input variable are the most important user defined parameters. The optimal number 
of rules heavily depends on the attribute subset used, which obviously complicates 
the wrapper step. Therefore, we have optimized this parameter after every fifth 
iteration of the variable backward elimination process. The classification quality 
depends on the number of rules in an almost step-like fashion. Thus it makes sense to 
redefine the number of rules in a rather heuristic approach at the left edge of a step. 

The number of fuzzy sets per input variable was generally set to two. For the 
binary variables this is an obvious choice, and for continuous attributes we also 
identified two fuzzy sets as being sufficient. Using more fuzzy sets will enhance 
classification quality just marginally, but exponentially increases the number of rules. 
After having identified the optimal parameters and the optimal attribute subset, we 
have finally trained NEFCLASS coming up with a rule base with 4 influencing 
attributes and 10 rules. The classification quality was quite good with 72.7% of 
correctly classified cases in the independent validation set (respondents: 68.7%; non- 
respondents: 77,7%). The rules (for the sake of simplification in a matrix notation) and 
the membership functions are shown in figure 3 a). 



7 Rule Postprocessing 

For all our efforts in attribute reduction, the resulting rule base still bears room for 
improvement. In figure 3 for example, we have aggregated the rules based on a C5.0 
decision tree. One can identify relevant variables, as decision trees locate them at the 
root and the first knots of the tree. The rules are sorted by these attributes. In such a 
sorted tableau one can simply identify and eliminate redundant variables and rules 
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(e.g. if the total number of transactions is large then it is an respondent disregarding 
the values of the other three attributes). There is, of course, no guarantee that the 
classification quality will not deteriorate significantly when the pruned rule base is 
applied to data from outside the calibration sample. In close analogy to overfitting 
problems in neural networks, checking out validation samples is highly 
recommended. 

In this study we have aggregated the rule base manually, but a tool is being 
developed, that will not only aggregate rules automatically, but also represent them to 
the user in an adequate and interactive way. 



Cheqctq 


Ichan 


Itax 


Ttlntrns 






u\ Ttlntrns 


Cheqctq 


Ichan 


Itax 






large 


small 


small 


large 




respondent 


^ large 








— > 


respondent 


small 


large 


small 


large 




respondent 


small 


large 


large 




— > 


respondent 


small 


small 


small 


large 




respondent 


small 


large 


small 


small 


— > 


respondent 


large 


small 


small 


small 




respondent 


small 


large 


small 


large 


— > 


non-resp. 


large 


large 


small 


small 




respondent 


small 


small 






— > 


non-resp. 


large 


large 


large 


small 




respondent 














small 


large 


small 


small 




non-resp. 














small 


small 


small 


small 




non-resp. 














small 


small 


large 


small 




non-resp. 














large 


small 


large 


small 




non-resp. 















Cheqctq: customer’s Ichan: investments with Itax: tax-oriented Ttlntrns: customer’s total 

quality score high risks investments number of transactions 




Fig. 3. Original and aggregated rule base including membership functions. 



8 Conclusion 

This study has shown the great prospects of Neuro-Fuzzy Data Mining. The results 
affirmed the previous experience that Neuro-Fuzzy Systems are not able to 
outperform alternative approaches, e.g. neural nets and discriminant analysis, with 
respect to classification quality. But they provide a rule base that is very compact and 
well understandable. Extensive preprocessing activities have been necessary, 
especially concerning the imputation of missing values and selection of relevant 
attributes and cases. Besides, we believe that these tools are also of general relevance. 
Intelligent postprocessing can further enhance the resulting rule base’s power of 
expression. In this paper, we have shown some promising approaches for these steps 
as well as their effectiveness and efficiency. Of course, we are still far away from an 
integrated data flow. But our experience with the single modules described above is 
very promising. However, these first results have to be further validated for different 
Neuro-Fuzzy Systems and different data situations. Our final goal is to integrate these 
modular solutions into a comprehensive KDD tool box. 

Funding for this project was provided by of the Thuringian Ministry for Science, 
Research and Culture. Responsibility for content is entirely with the authors. 
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Abstract. Planning adequate audit strategies is a key success factor in "a 
posteriori" fraud detection, e.g., in the fiscal and insurance domains, where 
audits are intended to detect tax evasion and fraudulent claims. A case study is 
presented in this paper, which illustrates how techniques based on 
classification can be used to support the task of planning audit strategies. 
The proposed approach is sensible to some conflicting issues of audit 
planning, e.g., the trade-off between maximizing audit benefits vs. 
minimizing audit costs. 



1 Introduction 

Fraud detection is becoming a central application area for knowledge discovery in 
databases, as it poses challenging technical and methodological problems, many of 
which are still open [1, 2]. A major task in fraud detection is that of constructing 
models, or profiles, of fraudulent behavior, which may serve in decision support 
systems for: 

• preventing frauds (a priori fraud detection), or 

• planning audit strategies (a posteriori fraud detection). 

The first case is typical of domains such as credit cards and mobile telephony [3, 4]. 
The second case concerns a whole class of applications, namely whenever we are faced 
with the problem of constructing models by analyzing historical audit data, to the 
purpose of planning effectively future audits. This is the case, e.g., in the fiscal and 
insurance domain, where an adequately targeted audit strategy is a key success factor 
for goverments and insurance companies. In fact, huge amounts of resources may be 
recovered in principle from well- targeted audits: for instance, the form of tax evasion 
consisting in filing fraudulent tax declarations in Italy is estimated between 3% and 
10% of GNP [5]. This explains the increasing interest and investments of 
governments and insurance companies in intelligent systems for audit planning. 

This short paper presents a case study for planning audits in the fiscal fraud 
detection domain. Audit planning is usually a difficult task, in that it has to take into 
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account constraints on the available human and financial resources to carry out the 
audits themselves. This complexity is present in our case study, too. Therefore, the 
proposed planning has to face two conflicting issues: 

• maximizing audit benefits, i.e., define subjects to be selected for audit in such a 
way that the recovery of evaded tax is maximized, and 

• minimizing audit costs, i.e., define subjects to be selected for audit in such a way 
that the resources needed to carry out the audits are minimized. 

The case study has been developed within a project aimed at investigating the adequacy 
and sustainability of KDD in the detection of tax evasion. 



2 . Planning Audit Strategies in the Fiscal Domain 

In this section, a methodology for constructing profiles of fraudulent behavior is 
presented, aimed at supporting audit planning. The reference paradigm is that of the 
KDD process, in the version of direct knowledge extraction [6]. The reference 
technique is that of classification, using decision trees [7]. 



2.1 Identification of Available Data Sources 

The dataset used in the case study consists of information from tax declarations, 
integrated with data from other sources, such as social benefits paid by taxpayers to 
employees, official budget documents, and electricity and telephone bills. Each tuple 
in the dataset corresponds to a (large or medium) company that filed a tax declaration 
in a certain period of time: we shall use the word subject to refer to such companies. 
The initial dataset consists of 80643 tuples, with 175 numeric attributes (or features), 
where only a few are categorical. From this dataset, 4103 tuples correspond to audited 
subjects: the outcome of the audit is recorded in a separate dataset with 4103 tuples 
and 7 attributes, one of which represents the amount of evaded tax ascertained by the 
audit. Such feature is named recovery. The recovery attribute has value zero if no fraud 
is detected. 



2.2 Identification of a Cost Model 

Together with domain experts, a cost model has been defined, to be included in the 
predictive model. In fact, audits are very expensive in both human and financial 
resources, and therefore it is important to focus audits on subjects that presumably 
return a high recovery. The challenging goal is therefore to build a classifier, which 
selects those interesting subjects. 

The cost model in our case study has been developed as follows. First, a new attribute 
audit_cost is defined, as a derived attribute, i.e., a function of other attributes. 
audit_cost represents an estimation, provided by the domain expert, of the cost of an 
audit which grows with the square of the sum of employees number and of sales 
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volume of the subject to be audited. Next, we define another derived attribute 
actual_recovery, as the recovery of an audit without the audit costs. Therefore, for each 
tuple i, we define: 

actual_recovery(i) = recove ry(i) - audit_costs(i) 

The attribute actual_recovery is used to discriminate between subjects with a positive 
or negative value for such attribute. The key point is in using the cost model within 
the learning process itself, and not only in the evaluation phase. 

The target variable of our analysis is constructed from actual_recovery, by defining 
the class of actual recovery - car in short - in such a way that, for each tuple /: 
car(i)= negative if actual_recovery(i) <= 0 

positive if actual_recovery (i) > 0 

The goal is a predictive model able to characterize the positive subjects, which are 
eligible to be audited. 



2.3 Preparation of Data for Analysis 

Data transformation. This phase was extremely time consuming, due to the 
presence of legacy systems, huge operational databases (hierarchical and relational), 
inconsistent measure units and data scales. 

Data cleaning (row removal). Noisy tuples, i.e., those with excessively 
deviating attribute values, have been removed, as well as those tuples with too many 
null attribute values. After data cleaning, the initial audited subjects became 3880: 
3183 tuples in negative car (82%), and 697 in positive car (18%). 

Attribute selection (column removal). The selection of relevant attributes is a 
crucial step, which was taken together with domain experts. The available 175 
attributes were reduced 20, by removing irrelevant, derived ones. 

Choice of training-set and test-set. The correct size of the training set is an 
important parameter in a classification experiment. While the size of the training set 
increases, the complexity of the induced model also increases, as the training error 
(i.e., the misclassification rate on training-set tuples) decreases. This does not imply 
that large training-sets are necessarily better: a complex model, with a low training 
error, may behave poorly on new tuples. This phenomenon is named overfitting: the 
classifier is excessively specialized on the training tuples, and has a high 
misclassification rate on new (test) tuples. The classical remedies to overfitting 
include downsizing the training-set, and increasing the pruning level. 

Our case study adopts an incremental samples approach to sizing the training-set 
[8], consisting in training a sequence of classifiers over increasingly larger, randomly 
generated subsets of the dataset - 10%, 20%, 33%, 50%, 66%, 90% of the total 
dataset. We discovered that the resulting classifiers improve with increasing training- 
sets, independently from the pruning level. In other words, and not unsurprisingly, 
there is no risk of overfitting, since the size of the dataset is relatively small with 
respect to the complexity of the knowledge to be extracted. 

As a consequence, the 3880 tuples in the dataset were partitioned as follows: 

- training set: 3514 tuples - test set: 366 tuples. 
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2.4 Model Construction 

We remind that our goal is a binary classifier with the attribute car (class of actual 
recovery) as target variable. The decision trees are trained to distinguish between 
positive car (fruitful audits) and negative car (unfruitful audits). Once the training 
phase is over, the test-set is fed to the classifier, to check whether it is effective in 
selecting on the new tuples. In our case, it is relevant not only the misclassification 
rate of the classifier on the test-set, but also the actual_recovery (= ascertained evaded 
tax - audit cost) obtained from the audits of the subjects from the test-set which are 
classified as positive. This value can be matched against the real case, where all (366) 
tuples of the test-set are audited. This case, which we call Real, is characterized by the 
following: 

• audit#(Real) = #( test-set) = 366 

• actual_recovery(Real) = Xig test-set ^ctual_recovery(i) = 159.6 M Euro 

• audit_costs(Real) = Xtg test-set ^udit_costs(i) = 24.9 M Euro 

where recovery and costs are expressed in million euros. As the test-set consists of 
audited subjects, by comparing the values of the Real case with those of the subjects 
classified as positive by the various classifiers, it is possible to evaluate the potential 
improvement of using data mining techniques to the purpose of planning the audits. 

Therefore, the classifiers resulting from our experiments are evaluated according to 
the following metrics, which represent domain-independent (1 and 2) and domain- 
dependent (3 through 6) indicators of the quality of a classifier X: 

1 . confusion_matrix(X) 4. audit_costs(X) 

2. misclassification_rate(X) 5. profitability(X) 

3. actual_recovery(X) 6. relevance(X) 

where: 

1. The confusion matrix, which summarizes the prediction of classifier X over the 
test-set tuples, is a table of the form: 



classified negative 


classified positive 




#TN 


#FP 


really negative 


#FN 


#TP 


really positive 



where the sets TN, TP, FN, FP are defined as follows, using the notation predxii) to 
denote the car (either positive or negative) of a tuple i predicted by classifier X: 

• TN = { i I predxii) = car(i) = negative } 

is the set of tuples with negative class of actual recovery which are classified as 
such by classifier X (true negative subjects); 

• FP = { i I predxii) = positive AND carii) = negative} 

is the set of tuples with negative class of actual recovery which are misclassified as 
positive by classifier X ifalse positive subjects); these are non fraudulent subjects 
which will be audited, according to X, with a negative actual recovery ; 

• FN = {i I predxii) = negative AND carii) = positive} 

is the set of tuples with positive class of actual recovery which are misclassified as 
negative by classifier X ifalse negative subjects); these are fraudulent subjects 
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which will not be audited, according to X, although the audit would have a positive 
actual recovery (a loss for missing a fruitful audit); 

• TP = { / 1 predx(i) = car(i) = positive} 

is the set of tuples with positive class of actual recovery which are classified as 
such by classifier Z (true positive subjects). 

2. The misclassification rate of X is the percentage of misclassified test-set tuples. 
More precisely, it is the ratio between the cardinality of (FP UNION FN) and the 
cardinality of the test-set: 

• misclassification_rate(X)= #(FP UNION FN) * 100 / #( test-set) 

3. The actual recovery of X is the total amount of actual recovery for all tuples 
classified as positive by X\ 

• actual_recovery(X) = actual_recovery(i), where P = TP UNION FP 

4. The audit costs of X is the total amount of audit costs for all tuples classified as 
positive by X: 

• audit_cost(X) = audit_cost(i) 

5. The profitability of X is the average actual recovery per audit, i.e., the ratio 
between the total actual recovery and the number of audits suggested by X: 

• profitability(X) = actual_recovery(X) / #P 

6. The relevance of X relates profitability (a domain-dependent metric) and 
misclassification rate (a domain-independent metric): 

• relevance(X) = 10 * profitability(X) / misclassification_rate(X) 

Classifier Construction. We considered two distinct approaches to classifier 
construction, each driven by two different policies in audit planning: on one hand, we 
can aim at keeping FP as small as possible, in order to minimize wasteful costs. On 
the other hand, we can aim at keeping FN as small as possible, in order to maximize 
evasion recovery. The two policies are clearly conflicting: as FP shrinks, TP shrinks 
accordingly, while FN (and TN) inevitably grows; the situation is dual when FN 
shrinks. The first policy is preferable when resources for audits are limited, the second 
when resources are in principle infinite. In practice, it is needed to find an acceptable 
trade-off between the two conflicting policies, by balancing the level of actual 
recovery with the resources needed to achieve it. The classifier construction method is 
therefore presented highlighting the parameters that may adequately tuned to reach the 
desired trade-off 

Parameter tuning. The following is the list of main tuning methods used in our 
case study, which we perceive as most relevant for the whole class of applications. 

• The pruning level: the absence of overfitting enabled us to use a low pruning level 
(less than 10%) in all experiments: the resulting trees are therefore as large as at 
least 90% of the corresponding trees obtained without pruning. 

• The misclassification weights: these are constants that can be attached to the 
misclassification errors - FP and FN in our case. The tree construction algorithm 
uses the weights to minimize errors associated with greater weights, by modifying 
the probability of misclassification rate. Misclassification weights are the main 
tool to bias the tree to minimize either FP or FN errors. 
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• The replication of minority class in the training-set: typically, a classifier is biased 
towards the majority class in the training-set. In our case study the majority class is 
that with car = negative. Thus, another approach to minimize FN is to artificially 
replicate the positive tuples, up to achieving a balance between the two car’s. 

• The adaptive boosting: the idea is to build a sequence of classifiers, where classifier 
k is built starting from the errors of classifier k-l [9]. The majority of votes casted 
by the different classifiers establishes the classification of a new tuple. Votes ate 
weighted with respect to the accuracy of the classifiers. This technique yields a 
sensible reduction of misclassification rate. 



2.5 Model Evaluation 

We now present two classifiers, as an illustration of the above construction 
techniques, and assess their quality and adequacy to the objectives. The former 
classifier follows the “minimize FP” policy, whereas the latter classifier follows the 
“minimize FN” policy. 

Classifier A. Experiment A simply uses the original training-set, and therefore we 
obtain a classifier construction biased towards on the majority class of training-set, 
i.e., the negative car. As a consequence, we enforce the “minimize FP” policy without 
using misclassification weights. To reduce errors, we employ 10-trees adaptive 
boosting. The confusion matrix of the obtained classifier is the following: 



classified negative 


classified positive 




237 


11 


really negative 


70 


48 


really positive 



Classifier A prescribes 59 audits (11 of which wasteful), and exhibits the following 
quality indicators: 

- misclassification_rate(A) = 22% (81 errors) 

- actual_recovery(A) = 141.7 M Euro - profitability(A) = 2.401 

- audit_costs(A) = 4 M Euro - relevance(A) = 1 .09 

Profitability of model A is remarkable: 141.7 Meuro are recovered with only 59 
audits, which implies an average of 4,649 Meuro per audit. In comparison with Real 
case, A allows to recover 88% of the actual recovery of Real with 16% of audits. 

Classifier B. Experiment B adopts the “minimize FN’ policy, and tries to bias the 
classification towards the positive car. A training-set with replicated positive tuples is 
prepared, with a balanced proportion of the two classes, in conjunction with 
misclassification weights that make FN errors count three times as much as FP errors 
(i.e., weight of FP = 1 and weight of FN = 3). Adaptive boosting (3-trees) is also 
adopted. The confusion matrix of the obtained model B is the following: 
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classified negative 


classified positive 




150 


98 


really negative 


28 


90 


really positive 



Classifier B prescribes 188 audits (more than 50% of which wasteful), and exhibits the 
following quality indicators: 

- misclassification_rate{B) = 34% (126 errors) 

- actMa/_recovery(B) = 165.2 M Euro - profitability{B) = 0.^78 

- audit_costs{E) = 12.9 M Euro - relevance(B) = 0.25 

Combined classifiers. More sophisticated models can be constructed by suitably 
combining diverse classifiers together. For instance, predictions of two classifiers can 
be put in conjunction, by considering fraudulent the subjects classified as positive by 
both classifiers. Conversely, predictions can be put in disjunction, by considering 
fraudulent the subjects classified as positive by either classifiers. The following are the 
indicators for the model A and B, which prescribes 58 audits: 

- actual_recovery(A and B) = 141.7 M Euro 

- audit_costs(A and 5) = 3.9 M Euro 

- profitability(A and B) = 2.44 

and the indicators for the model A or B, which prescribes 189 audits: 

- actual_recovery(A or B)= 165.1 M Euro 

- audit_costs(A or B) = 13.0 M Euro 

- profitability(A or B) = 0.87 

Clearly, conjunction is another means to pursue the “minimize FP” policy, and 
conversely disjunction for the “minimize FN"" policy. The first policy usually yields 
more profitable models, as the examples show. This form of combination may be 
iterated, e.g. combining the classifier A and B with another classifier C (a trade-off 
between A and B), obtaining a model A and B and C which prescribes 43 audits: 

- actual_recovery(A and B and C) = 56.0 M Euro 

- audit_costs(A and B and C) = 3.2 M Euro 

- profitability (A and B and C) = 1.3 

and a model ((A and B) or C) which prescribes 80 audits: 

- actual_recovery((A and B) or C)= 144.0 M Euro 

- audit_costs((A and B) or C) = 5.2 M Euro 

- profitability((A and B) or C) = 1.8 

Classifiers can be combined also by voting, and we have built classifiers where at 
least n classifiers out of a set of n+m decide the number of audits to plan. 



3. Concluding Remarks 



The first consideration coming from the experience sketched in this paper is about the 
complexity of the KDD process. While the objectives of the various phases of the 
KDD process are clear, little support is provided to reach such objectives. Two main 
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issues are, and will remain in the near future, the hot topics of KDD community 
research agenda: 

• the first point is methodology: it is crucial to devise methods tailored to relevant 
classes of similar applications; 

• the second point is the need to identify the basic features of an integrated 
development environment, able to support the KDD process in all its phases. 

Our experience indicates that a suitable integration of deductive reasoning, such as that 
supported by logic database languages, and inductive reasoning, such as that supported 
by decision trees, provides a viable solution to many high-level problems in the 
selected class of applications. In particular, we found such integration useful in the 
phase of model evaluation, where the uniform representation in the same logic 
formalism of data for the analysis and the results of the analysis itself allows a high 
degree of expressiveness. In another paper [10], we show how the entire KDD process 
here identified can be conveniently formalized and realized by using a query language 
that integrates the capabilities of deductive rules with the inductive capabilities of 
classification. 

We conclude by mentioning that in our experiments we used C5.0 [11], the most 
recent available version of the decision tree algorithm that Quinlan has been evolving 
and refining for many years. 



References 

1. Fawcett, T, Provost, F., “Adaptive Fraud Detection”, Data Mining and Knowledge 
Discovery, Vol. 1, No. 1, pp. 291-316, (1997). 

2. Uthurusamy, R., “From Data Mining to Knowledge Discovery: Current Challenges and 
Future Directions”, in Knowledge Discovery in Databases, Piatesky-Shapiro and Frawley 
(eds.), AAAI Press, Menlo Park, CA, (1991). 

3. Fawcett, T, Provost, F., “Robust Classification Systems for Imprecise Environment”, 
Proc of the 15th Int. Conf AAAI-98, (1998). 

4. Stolfo, S., Fan, D., Lee, W., Prodromidis, A., Chan, P., “Credit Card Fraud detection 
using Metalearning: Issues and Initial Results”, Working Notes AAAI-97, (1997). 

5. Tanzi, V., Shome, P., “A Primer on Tax Evasion”, in IMF Staff Papers, No 4, (1993). 

6. Berry, M., Linoff, G., Data Mining Techniques for Marketing, Sales and Customer 
Support, Wiley Computer Publishing, New York, USA (1997). 

7. Breiman, L., Eriedman, J. H., Olshen, R. A., Stone, P. J., Classification and regression 
trees, Belmont, CA, Wadsworth (1984). 

8. Indurkhya, N., Weiss, S. M., Predictive Datamining: a pratical guide, Morgan Kaufman, 
San francisco, CA, (1998). 

9. Freund, Y., “Boosting a Weak Learning Algorithm by Majority”, Information and 
Computation, 121(2), pp. 256-285, (1995). 

10. Bonchi, F., Giannotti, F., Mainetto, G., Pedreschi, D., “A classification-based 
methodology for planning auditing strategies in fraud detection”, accepted at KDD ’99. 

11. http://www.rulequest.com/ 




Analysis of Accuracy of Data Reduction Techniques 



Pedro Furtado and H. Madeira 



University of Coimbra 
Portugal 
pnf@dei.uc.pt 



Abstract. There is a growing interest in the analysis of data in warehouses. 
Data warehouses can be extremely large and typical queries frequently take too 
long to answer. Manageable and portable summaries return interactive response 
times in exploratory data analysis. Obtaining the best estimates for smaller 
response times and storage needs is the objective of simple data reduction 
techniques that usually produce coarse approximations. But because the user is 
exposed to the approximation returned, it is important to determine which 
queries would not he approximated satisfactorily, in which case either the base 
data is accessed (if available) or the user is warned. In this paper the accuracy 
of approximations is determined experimentally for simple data reduction 
algorithms and several data sets. We show that data cube density and 
distribution skew are important parameters and large range queries are 
approximated much more accurately then point or small range queries. We 
quantify this and other results that should be taken into consideration when 
incorporating the data reduction techniques into the design. 



1. Introduction 

Data warehouses integrate information from operational databases, legacy systems, 
worksheets or any external source, to be used for decision support. The data 
warehouse must have efficient exploration tools, which, regardless of data size, may 
give fast reasonably approximate answers to users exploring the data interactively and 
multidimensional models are usual for the interactive exploration of the data in Online 
Analytical Processing (OLAP). To build the data cube, facts and dimensions must be 
identified as well as the data granularity. The dimensions can be products, stores and 
time with granularity of days. The space needed is calculated as in: 
time span = three years 

# products = 100.000 products (of which only 20% are sold daily) 

# stores = 100 

n” of records in the fact table = 3 X 365 X20.000 X 100 = 2.19Gbytes 
average record size = 8 attributes X 4 bytes = 32 bytes 
These figures do not include indexes, materialized views and other fact tables. A 
complete data warehouse such as this one could have 70 GBytes of size. Data 
reduction techniques can be applied to any data cube derived from the facts or 
materialized views to reduce parts of the multidimensional space and obtain fast 
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response times for approximate answers. This in turn is very important for exploring 
the data. There are several alternative data reduction techniques, such as sampling [2, 
7, 8], singular values decomposition (SVD) [6], wavelets [10], histogram-based 
techniques such as MHIST [5], clustering algorithms such as BIRCH [14] and index 
trees. These techniques are summarized in [9]. Data analysis needs for approximate 
answers directly expose the user to the estimates obtained. Although the reduced data 
is frequently associated with a very coarse initial approximation of the data, accuracy 
is very important. Even the simplest histogram-based reduction techniques return very 
small errors for large range queries encompassing whole summary regions. But 
queries may not encompass whole regions and answers to smaller range queries are 
also important. Furthermore, the possible absence or slow access of base data stored 
in tertiary memory or the reduction of summary tables in which points represent 
aggregated values require higher accuracy. In any case the exploration tool must be 
able to determine which queries are inaccurate and either access the base data or warn 
the user. Typical estimation errors are determined experimentally in this paper. The 
input data set is reduced using alternative techniques and several classes of queries are 
issued to determine the average estimation error. The experiment involves different 
data distributions and characteristics such as skew, density and sparseness. The results 
obtained in these experiments are used to conclude the accuracy that can be expected 
from data reduction algorithms. The paper is organized as follows. In section 2 
alternative generic reduction strategies are discussed. Section 3 presents the data 
reduction strategies and section 4 the data sets used in the experiments. Section 5 
shows the point and range error results that were obtained from the experiments and 
the conclusions that can be drawn from such results. Section 6 concludes the paper. 



2. Alternative Reduction Strategies 

In this section we address the strategies for histogram-based data reduction and the 
impact on storage of choosing a given strategy. 

Multidimensional data points are represented in relational OLAP (ROLAP) as 
tuple(aj,...,a_,,Vj,...,v_„). The multidimensional view can be obtained from the tuples by 
using the dimension attributes a^ as axis and the values v^ as the data cube contents. 
The task of the reduction algorithm is to derive approximate values for sets of value 
attributes Vj in the data cube regions, reducing the data set size (for simplicity we will 
consider only one value attribute). The reduced data can be stored as a summary, 
loaded or maintained in memory for fast answers to queries from tools exploring the 
data. 

2.1 Classification of Reduction Techniques 

The types of data reduction techniques we evaluate divide the multidimensional 
space into regions and approximate each region by a summarized description. Fast 
querying and searching is obtained by accessing the summarized descriptions. A 
generic summary is a set of regions R([aj^,ajJ,...,[a,,^,aJ,coeffj,...,coeffJ forming a 
histogram where a region is usually called bucket or cell. We define some important 
properties of data reduction techniques regarding the resulting histograms: 
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Reduction strategy - distinguishes algorithms producing fixed grids (fixed grid 
strategy - FG) or variable sized buckets (variable grid strategy - VG). 

Variable grid strategy adaptability - measures the degree to which the space 
partitioning strategy is able to adapt to the data distribution. 

Approximation function - used to determine the coefficients that approximate 
the data in each bucket. Those coefficients will be kept in the bucket. 

Approximation function adaptability - measures the degree to which the 
approximation function is able to adapt to the data distribution. 

The fixed grid strategy (FG) imposes a fixed grid upon a multidimensional space 
view of the input data and approximates each grid cell by the approximation function 
coefficients. The variable grid partitioning strategies (VG) determines the best 
bucket partitioning of the same multidimensional space. A generic bucket produced 
by the VG strategy is represented by the structure bucket(MBR(ll_point,ur_point), 
data), where MBR denotes the region minimum bounding rectangle. Buckets 
produced by the fixed grid strategy can be represented more compactly by either the 
structure bucket(bucket_ID, data), where bucket_ID is determined by a computation 
on the indices, or stored as multidimensional array cells as cell(data) where the cell 
position is also determined by computation on the indices. Variable grid strategies use 
alternative algorithms to partition the space into buckets dynamically. There are 
recent proposals for both fixed grid and variable grid algorithms. Regression is used 
in [1] and wavelets in [11]. These are fixed grid techniques that use the approximation 
function and occasionally outliers to obtain a higher adaptability. Mhist [5] is a 
variable grid technique. In this paper we consider mainly fixed grid techniques 
because, by storing only the reduced values (molap organization), a lot of space is 
saved in comparison with adaptable techniques (which must store the buckets as 
explained before) and reduced values are accessed by simple computation of an offset. 
The extra space can be traded for lower reduction rates to improve approximations. 

Data with smooth variations is usually easier to approximate by most algorithms 
and large peaks often disturb the approximation. For this reason the use of outliers is 
important in histogram-based techniques whenever there are strong “thin” peaks such 
as a point completely divergent from the normal trend. Nevertheless, outliers are very 
expensive in terms of storage space: they require the storage of both the point 
coordinates and the value and do not provide associative access (must be searched). 
An outlier is stored as Outlier(point, value). 

Summary tables (or materialized views) are frequently computed to speed-up query 
answering in data warehouses: group-by queries are issued to compute partial sums on 
alternative combinations of dimension attributes, building the summary or 
materialized view. It is possible to recover range values from partial sums using the 
independence assumption or linear regularization [3]. This way, aggregation can be 
used as a data reduction technique. 
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2.2 Storage Cost Analysis of Fixed and Variable Grid Strategies 

Variable grid strategies must store bucket boundaries, while fixed grid strategies 
store only the approximation coefficients. This subsection compares the storage space 
for the two alternative strategies to quantify the overhead incurred hy the VG strategy. 
The storage space occupied by both strategies is simply, 

ss(fixed_grid) = sizeofcoeffs (molap organization) 

ss(variable grid) = 2 xPRSZ + sizeofcoejfs (bucket tuples) 



Given the following quantities. 



Point reduction factor PRF = 



# buckets 

# points 



Storage reduction factor SSR = 



space occupied by summary 
initial space 



Points representation size PRSZ = size of coordinates x n° of dimensions 



( 2 ) 

( 3 ) 

( 4 ) 



Figure 1 compares SSR against PRF for VG and FG strategies considering 2 to 10 
dimensions and coordinates with 2 to 4 hytes (PRSZ between 4 and40). The data 
values size considered was 4 bytes. 





PRSZ 



PRSZ 



(a) Variable Grid Strategy (b) Fixed Grid Strategy (stored as molap) 
Fig. 1. Storage Reduction Factors for Fixed grid and Adaptable Strategies 



This figure quantifies the space overhead required to represent the buckets in 
adaptable strategies and, conversely, the space gains when fixed grid strategies are 
used. For instance, for a point reduction to 10% of an original five dimensional space 
(with each dimension represented by 2 bytes: PRSZ = 10), the reduced data occupies 
17% of the original data size for the adaptable strategies and 3% for the fixed grid 
approach. This shows that adaptable partitioning strategies incur in high storage space 
overhead in comparison to fixed grid strategies, which also offer faster computation. 
These are strong motivations for the choice of these algorithms instead of variable 
grid ones, although the lack of partitioning adaptability must be compensated by 
approximation function adaptability, requiring more coefficients in each cell (e.g. 
Wavelets) (bucket partitioning vs. coefficients storage overhead). 
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3. Data Reduction Techniques 

The data reduction techniques work on a multidimensional view of the input data, 
which must be cubed by parts (by loading arrays from tables such as in [12] - in 
production systems - or issuing queries against the database that retrieve the regions 
successively) to be fed as input to the reduction algorithm. It is preferable to apply the 
reduction only to non-empty points (those appearing in the rolap table) because the 
approximation error will be much larger if zeroes are also approximated. Empty 
positions are indicated by heavily compressed zero bitmap cuboids. The data 
reduction algorithms used in the experiments are: 

Average (g) - this gold experiment simply returns the average value. If the error is large, 
the data set will be difficult to approximate. 

Outliers (ol) - This algorithm simply extracts extremes, storing them as outliers. It can be 
used with any other technique, smoothing peak data variations. Outliers are stored as 
outlier(point, value) pair in table tuples. 

Mhist (mhist) - This VG technique was proposed in [5] for selectivity estimation. It 
implements space partitioning by analyzing marginal frequencies. Buckets are stored in 
table tuples as bucket(bucket_ll,bucket_ur, avg). 

Fixed Grid (fgrid) - This FG technique divides the space into equal-sized regions and 
computes the average for each one. It is stored as a multidimensional array as cell[avg]. 
Regression (regr) - We used the implementation of linear regression described in Quasi- 
Cubes [1]. This FG technique approximates the values in a column or row bucket by a line 
described by the parameters m and b my = mxx + b and stores (m,b) in the bucket cell in 
a multidimensional array: cell[(m,b)]. 

Wavelets (wav) - The wavelets technique (FG) was proposed for selectivity estimation in 
[11] and for data cube reduction in [10]. Wavelets represent a function in terms of a coarse 
overall shape, plus details that range from coarse to fine. Wavelet coefficients had to be 
stored together with a location identifier because smaller coefficients were stripped from 
the array of coefficients: cell[set(locator,coeff)]. 



In Figure 2 we further characterize the techniques. 
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Fig. 2. Characterization of Techniques 
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4. Datasets 

We have tested synthetic and typical warehouse data sets. The synthetic data sets 
were random (completely random values), zipf [13], and clustered (n clusters 
following the normal distribution). Figure 3 shows typical shapes for those 
distributions. The zipf distribution (figure 3(a) and (b)) is said to be typical of many 
real data distributions in databases [13]. The skewed zipf distributions (figure 3(b)) 
contains high thin peaks which are often difficult to approximate by fixed grid 
techniques but can be extracted as outliers. Skewed clustered distributions (figure 
3(c)) also show some peaks, but those peaks are thick and therefore cannot be handled 
efficiently by outliers. Adaptable techniques handle these skews much better than 
fixed grid techniques because they adapt the bucket boundaries to the topography. 

(a) Quasi-uniform Zipf (b) Very Skewed Zipf (c) Very Skewed Clustered 




Fig. 3. Typical Synthetic Data Sets 



Typical warehouse data sets were taken from several data warehouses in [4], 
including the sales dataset (productxdaysxstores = revenue: 60 x 184x20) and 
summaries resulting from roll-up operations (e.g days to weeks) (figure 4). 
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Fig. 4. Typical Data Sets 



5. Experiments 



Two error measures were used: the point query error and the range query error. 
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These error measures are plotted against storage space reduction. Query 
experiments were held for small ranges with areas ranging from 1 to 10.000 points 
and large ranges. For each sub-category we made 10,000 queries of the type range- 
sum query. We first discuss the results for clustered and zipf synthetic data sets which 
reveal important characteristics. Then we show the results for the SALES data set. 
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5.1 Clustered Distributions 

We use four clustered data sets to show how skew and sparseness influence the 
results, 
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Figure 5 shows the point error for these data sets. The x-axis represents the storage 
space reduction factor (the output data size as a percentage of the input data size) 
when the input data is represented either as a data cube (%DC) or using a rolap 
organization (the data cube is the most compressed representation for non-sparse data 
sets - dimension attributes are implicit while rolap is the most compressed 
organization for sparse data sets - data cubes must represent empty points as zeroes). 
From the reduced data, only fgrid and regr are stored in small data cube 
representations, while wavelets must store a large number of coefficients in each cell, 
ol must store the points in tuples and mhist must store the buckets in tuples as well. 
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(d) p error on Cl_Skew_Sp (Sparseness 90%) 



Fig. 5. Perror for Clustered Data Sets 
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Discussion of Results: Figure 5(a) shows a point error of 8% for the gold experiment 
and a random distribution. Summaries occupying less than 10% of the data cube do 
not achieve any significant reduction of estimation error. The techniques 
progressively become more effective as more storage space is allocated for the 
reduced histogram. With about 80% of the data cube (30% of the rolap input) the 
error is halved to 4%. Figure 5(b) shows an estimation error of 90% for the gold 
experiment on skewed cluster distributions. With about 20% of the data cube size (7% 
of rolap input size) fgrid and regr techniques reduce the error to 20%, waveletes to 
30% and mhist or outliers to 60%. With 80% of the data cube (30% of rolap) both 
wavelets and regression techniques reduce the average point error to 4%. These 
results suggest that very large reduction rates produce high estimation errors. 
Comparing (a), (b) and (c) we can see that skew is a source of approximation 
difficulty for very large reduction rates. The gold experiment (g) gives a clue to 
quantify the degree of this difficulty. The skew increases from the Random 
distribution to Cl_Skew and to Cl_Lskew and the point error for the gold experiment 
was 10%, 95% and 125% respectively. Wavelets are particularly well succeeded for 
skewed distributions, but have difficulty approximating low skew distributions for 
very large reduction rates, because wavelet coefficients occupy a significant space and 
important coefficients should not be dropped (higher reduction rates are obtained by 
dropping more coefficients, starting by the least significant ones). The three 
techniques - wav, ol and mhist - would be superior to fgrid and regr considering the 
point reduction factor (PRF) but that superiority is often lost by considering the 
storage space reduction factor (SSRF) (see figure 5(b)). Still, for very skewed 
distributions the adaptability (coefficient or point adaptability in the case of wavelets 
or outliers respectively) allows these techniques to yield better results than fgrid or 
regr (see figure 5(c)). Even regr achieves better results than/gnc? because it is slightly 
more adaptable. Figure 5(b) and (d) were generated similarly but with totally different 
sparseness (10% on (b) and 80% on (d)). In this case the same reduction to 40% of the 
data cube size corresponds to completely different reduction if rolap input is 
considered (15% in (b) against 67% in (d)). This is because the non-reduced data cube 
can become considerably larger than the rolap representation due to sparseness (non- 
existent points in the rolap organization must be represented in the data cube). The 
estimation error is small when compared against the non-reduced data cube size 
(because a large fraction of the data cube are zeroes which are not included in the 
approximation) but, if compared against rolap size, it is comparable to those in Figure 
5(b). This has two major implications: it is advantageous to compute reduced data 
cubes instead of normal data cubes when data sets are sparse because the normal data 
cube wastes a large space with zeroes, while the reduced version is much more 
compact with a small error. On the other hand, in order to be accurate, these reduced 
data cubes occupy a space that corresponds to a large portion of the initial rolap space, 
both for dense or sparse data sets. 

Figure 6 shows the range query estimation errors vs query size for clustered data 
sets using a reduced data cube to (3.7% rolap / 10% DC ) in (a), (b) and (c), and a 
reduction to (14.8% rolap / 40% DC ) in (d). The query size is the query range 
volume. 
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(a) Rerror on Random (3.3% rolap / 10% DC ) 



(b) Rerror on CLSkew (3.3% rolap / 10% DC ) 





(c) Rerror on CLLSkew (3.7% rolap / 10% DC ) (d) Rerror on Cl_Lskew_Sp (14.8% rolap / 40% DC ) 
Fig. 6. Range Query En'ors for Clustered Datasets and Reductions as Indicated 

Results in figure 6 show that large range queries return reasonably accurate results. 
This result is logical. For instance, fgrid buckets lying completely within the query 
range contribute with error 0 to the result because the bucket average value is used. 
For the random data set, ranges with areas above 500 points had an error below 0.4%. 
The results are not so good for skewed distributions (b) and for very skewed 
distributions (c). Figure 6(c) and (d) show that more adaptable techniques (mhist, ol, 
wav) are able to approximate very skewed distributions much more accurately than 
simpler techniques such as regr, fgrid or aggregation (these should use outliers to 
adapt to large skews). When a large portion of the data cube is kept (Figure 6(d)) and 
adaptable techniques are used the approximation is more accurate. Very sparse data 
sets have a very small number of non-empty points per unit area. To achieve real 
reduction of the rolap input in this case buckets must be very large and large range 
queries frequently select only a small number of non-empty points, giving significant 
estimation errors. 

5.2 Zipf Distribution 

Next we show the point and range error results for very skewed zipf data set. 
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(a) Point Error for Zipf Skewed Data Set (b) Range Error for Zipf Skewed Data Set (20%DC) 
Fig. 7. Point and Range Errors for Zipf Skewed Distribution 

Discussion of Results: This is a skewed distribution and the most adaptable 
techniques (wavelets, mhist and outliers) held the best results for point errors. Outliers 
held the best result overall because they eliminate the thin peaks of the zipf 
distribution. This means that outliers should be used together with other techniques 
for isolated points that are very distant from the approximation. 

5.3 Warehouse Data Sets 

We have chosen a normal (Sales) and an aggregated (Sales-aggreg) warehouse 
data set. Figure 8 shows results for Sales-aggreg, a very dense data set summarizing 
sales. 
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(a) Point Error on Sales-aggreg (b) Range Error an Sales-aggreg (20%DC) 

Fig. 8. Point and Range Errors for Sales-aggreg Data Set 

The dense Sales roll-up data set was difficult to approximate. Point errors are 
always large, regr was the best technique for large reduction rates and wav or outl for 
smaller reduction rates. This is because regr is more space efficient and wav or outl 
are more adaptable to the data distribution but less space efficient, mhist and regr 
achieved better results than the other techniques for range queries. 

The multidimensional view of rolap data is frequently very sparse. Most data sets 
presented before were dense but the Sales data set of Figure 9 is very sparse (97%). 
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(a) Point Error (b) Range Query Error (80% rolap/ 12%DC) 

Fig. 9. Point and Range Errors for the Sales Data set 

In Figure 9(a) the estimation error is reduced to very small values with only 9% to 
15% of the data cube (wav or oud are able to reduce the error to an insignificant 
amount with just 9% of the data cube). But the remarks made concerning sparse data 
sets apply here as well. 15% of the data cube corresponds to 100% of the rolap input 
size! The estimation error is so small simply because no storage space reduction was 
achieved (comparing to the rolap input). If the rolap data occupied 20 GB, the reduced 
data cube would occupy 20 GB as well. When compared against rolap size, the 
estimation error is comparable to the dense data set case. The range query results were 
also obtained for a small reduction rate considering the rolap data size (to 80%). For 
this size the range query errors are small using the wav or oud adaptable techniques. 



5.4 Experiment Conclusions 

The estimation error varies a lot with different data sets and distribution skews. 
Even the simplest techniques can obtain small estimation errors for range queries with 
large size (with a large number of non-empty points). But when querying typical 
sparse data sets, small ranges or points, the estimation error is frequently large. Even 
when a significant fraction of the input data size is allocated to the reduced summary, 
the error is still significant. For several data sets the estimation error does not decay 
exponentially as more space is allocated to the approximation. In the case of sparse 
data cubes, although they can be highly reduced, such reduction does not correspond 
to a large compression of the base (rolap) data. The tool doing analysis on the reduced 
summaries should rely on a minimum number of non-zero values to determine if a 
query can be answered with sufficient accuracy. This threshold could be for instance 
1000 non-empty values, but the actual value depends on the data reduction rate. 

None of the techniques seems to be substantially better than the other ones for all 
data sets. Although more sophisticated techniques such as mhist or wavelets obtain a 
lower point reduction factor (prf), they incur in higher storage overhead (for storing 
the buckets and coefficients), such that the approximation is not much better than the 
one obtained using compact molap reduced data sets from simpler algorithms. Still, 
adaptability is important to approximate irregular and skewed data sets. It is possible 
to conclude that the best results can be achieved by using a fixed grid strategy with 
some adaptability that can be obtained by either using strongly adaptable 
approximating functions such as wavelets or outliers. 
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6. Conclusions 

In this paper we have made an experimental evaluation of histogram-based data 
reduction techniques focusing on the approximation error for several classes of 
queries. The data reduction algorithms were classified according to important 
characteristics and those characteristics were compared through the experiments. Data 
sets were also analyzed to determine how the distribution, skew and sparseness are 
relevant to the approximation accuracy. We have derived some guidelines for data 
reduction tools. 
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Abstract. The recent proliferation of data mining tools for the analysis 
of large volumes of data has paid little attention to individual privacy 
issues. Here, we introduce methods aimed at finding a balance between 
the individuals’ right to privacy and the data-miners’ need to find gen- 
eral patterns in huge volumes of detailed records. In particular, we focus 
on the data-mining task of classification with decision trees. We base our 
security-control mechanism on noise-addition techniques used in statis- 
tical databases because (1) the multidimensional matrix model of sta- 
tistical databases and the multidimensional cubes of On-Line Analytical 
Processing (OLAP) are essentially the same, and (2) noise- addition tech- 
niques are very robust. The main drawback of noise addition techniques 
in the context of statistical databases is low statistical quality of released 
statistics. We argue that in data mining the major requirement of secu- 
rity control mechanism (in addition to protect privacy) is not to ensure 
precise and bias-free statistics, but rather to preserve the high-level de- 
scriptions of knowledge constructed by artificial data mining tools. 



1 Introduction 

New data collection technologies automatically capture millions of transactions. 
Every phone-call, purchase at a super-market, visit to a Web page, use of a 
credit card, etc., can now be easily logged together with its associated attribute 
information. Knowledge discovery and data mining (KDDM) has emerged as 
the technology to overcome the information overload and facilitate analysis and 
understanding of massive volumes of data [2,3]. While KDDM technology is 
maturing rapidly into commercial products that incorporate advances from the 
fields of statistics, machine learning and databases, little emphasis has been 
placed on privacy issues [14-17]. The predominant applications of KDDM are 
marketing applications which have regarded identification of individual profiles 
and attributes as a central goal of the process. Meaningful patterns that lead 
to understanding generic behaviours constitute an invaluable resource for cor- 
porations that need to know their customers, not only to preserve them in an 
increasingly competitive market, but also to extend their commercial relation- 
ship in even more saturated markets. 

Naturally, the application of KDDM technology is extending to other domains 
where the issues of individual privacy are certainly very delicate [10]. Examples 
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of such domains are the analysis of bank transactions for money laundering 
detection, income taoc returns or insurance claims for fraud detection, medical 
records for the discovery of groups at risk, etc. All these domains demand the 
development of technology that provides a balance between the data miners’ 
need to find generic patterns and the individuals’ right to privacy. We intro- 
duce methods aimed at finding this balance. However, to achieve this objective 
involves a trade-off in the context of statistical databases as well as in KDDM 
applications [6,7]. We argue that, in KDDM, the major requirement of security 
control mechanisms (in addition to protect privacy) is not to ensure precise and 
bias-free statistics, but rather to preserve the high-level descriptions of knowl- 
edge constructed by KDDM tools. In particular, we focus on the data-mining 
task of induction with decision trees. 

We base our security-control mechanism on noise- addition techniques used 
in statistical databases. Section 2 presents a justification for such a choice. In 
Section 3, we concentrate on the use of the decision tree as inductive tools for 
building classifiers for knowledge discovery. In Section 4, we propose an extension 
of a method presented in Section 2 and apply it to privacy protection in the 
context of KDDM. In Section 5, we support this proposal by empirical results 
in a case study, namely, the Wisconsin Breast Cancer Database [12] (or simply 
the WBC data) In Section 6, we provide concluding remarks regarding the 
effectiveness of the proposed security mechanism. 

2 Balancing disclosure 

There are three main reason why the notions from statistical databases are 
relevant to our task. 

First, the tabular data model of statistical databases and the multidimen- 
sional cubes of OLAP are very similar [21]. It is not hard to show that the 
abstract model of statistical databases [8] is equivalent to multidimensional ma- 
trix model, which is in turn essentially the same as the multidimensional cube 
in OLAP. A statistical database can be equivalently modelled by d-dimensional 
matrix (table, cube) in the following way. Denote attributes in the database by 
Ai, A 2 , . . .,Ad (the number of attributes d is often refereed to as the degree of 
the database). For each attribute, order the values that actually exist in the 
database. If the attribute is numerical, this can simply be increasing order; for 
categorical attributes, find some natural order. Let aj, a^, . . . , ,| be the se- 

quence of values of the attribute Aj . Construct a d-dimensional matrix S of the 
size |Ai| X IA 2 I X ... X \Ad\, so that its element Sri,r 2 ,...,rd (d G {a^, Oj, . . . , aj^ |}, 
j G {1, . . . , d}), represents the result of the following query: 

COUNT (Ai =al^^A2 = aX^...^Ai^ 

The multidimensional matrix model and the abstract model of statistical databases 
are equivalent in the sense that it is possible to transform one form into another 

* Data set retrieved from the University of California at Irvine. 
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without loss of information. This is not true for tabular model, whose entries 
contain summary statistics, as it has suffered information loss during the process 
of aggregation [1]. 

Second, statistical databases have made progress in protecting individual 
values while allowing general statistics and patterns to be produced. In a sta- 
tistical database some attributes are of confidential nature (eg., HIV diagnosis). 
Ideally, it should be impossible for a statistical user to deduce any significant 
information about individual values of these attributes. However, it should be 
said that any statistic involving the confidential attribute reveals ‘some’ infor- 
mation about its individuals values. It is the matter of a security policy to define 
what it is to be understood by ‘significant’ information. For example, the statis- 
tic SUM(City=Sydney; Blood_count)=11.8 reveals that no patient from Sydney 
has blood count over 11.8; this information is clearly insignificant. 

If disclosure of a confidential value occurs, the database is said to be com- 
promised. A positive compromise occurs if a user discloses the exact value of 
a confidential attribute, and a partial compromise occurs if a user is able to 
obtain substantial information about a confidential value, without disclosing it 
exactly [1]. Partial compromise includes: negative compromise, that is, disclosing 
the fact that, for a particular individual, an attribute does not have a certain 
value, or its value does not lie within a certain range; approximate compromise, 
ie., revealing that a confidential value lies in the given range; probabilistic com- 
promise, where a confidential value is disclosed with a certain probability; relative 
compromise [13], where the relative order of magnitude of two or more confiden- 
tial values is revealed. Security-control mechanisms in a statistical database can 
not (1) provide statistical users with sufficient amount of high quality statistics 
(statistical quality is measured by the consistency, bias and precision), and at 
the same time, (2) prevent exact and partial disclosure of confidential individual 
information. Thus, various techniques have been proposed for balancing these 
objectives, but none of them is both effective and efficient. The techniques can 
be classified into guerg restriction and noise addition. Query restriction includes 
Query size control. Query set overlap control. Maximum order control. Partition- 
ing, Cell suppression, and On the other hand, noise addition includes Output 
perturbation. Random sample techniques. Data Perturbation, Probability Dis- 
tribution Data Perturbation. 

The third reason why we refer to statistical databases is that recent sugges- 
tions to privacy protection for the context of KDDM [6,7,17] map directly to 
the framework of methods in statistical databases. 

We concentrate on Probability distribution data perturbation methods [1,11]. 
These replace the original database with a new one that has the same probabil- 
ity distribution. We shall describe so-called data swapping technique, which is 
particularly suitable for privacy protection in knowledge discovery. Data swap- 
ping interchanges the values in the records of the database in such a way that 
low-order statistics are preserved [8]. We recall that I;-order statistics are those 
that employ exactly k attributes. A database D is said to be k- transformable if 
there exists a database £)' that has no records in common with D, but has the 
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same A:-order COUNTs as D, for fc G {0, . . . , k}. However, finding a general data 
swap is an intractable problem [8]. Approximate data swapping [20] replaces the 
original database (or a portion of it) with randomly generated records, so that 
the new database has similax K-order statistics as the original one. In the context 
of statistical databases, the drawbacks of this method are high implementation 
cost, unsuitability for on-line dynamic databases and low precision. We shall 
apply Data Swapping in Section 4 for privacy in a KDDM context. 

3 Inducing Classifiers 

One of the most common KDDM tasks consists of building classifiers [2]. This 
task takes as input a set of classes and a training set consisting of pre-classified 
cases, and builds a model of some kind that can be applied to unclassified data 
in order to assign it a class. More precisely, we can think of a classifier as a 
function f : Ai x . . .Ad — *• C, and of an n-record training set as a collection of 
n points /(a;*) = C(, i = 1, . . . ,n for the function /. The training set is typically 
presented as cases in attribute-vector format; that is, each case is a row in a 
table and the i-th cases is a vector x* = (id, x\, .. .,Xj, a). The first entry id is 
an identifier that uniquely identifies this case. The j-th attribute of the i-th case 
has value x'j E Aj, i = 1, .. .,n and j = 1, . . . , d. Finally, c,- G C is the class for 
the i-th case. The goal of the model is not only to have high-predicting power 
on unseen (future) cases, but also, specially for knowledge discovery, to describe 
how the class depends on the attributes. That is, for knowledge discovery, the 
computer tools should provide some understanding of the data. 

We shall consider decision trees where an internal node, labelled with the 
categorical attribute At, has |At| edges connecting the node to its \At\ children, 
each edge labelled with one of the values in the set At- The label aj of the edge 
indicates that this child (usually the j-th for categorical attributes) is to be tried 
next when classifying a record, if and only if the record has the value aj as its 
t-th attribute (that is, Xt = a)). If the t-th attribute is numerical (also referred 
as continues), then typically, the internal node has only two outgoing edges (and 
thus only two children), associated with the intervals Xt < b and Xt > b, for some 
bound b. The edge labelled Xt < b is selected if and only if the t-th attribute of 
the record is less than the bound 6. 

Fig. 1 shows the decision tree built by the latest version of Quinlan’s famous 
decision tree builder, now named C5 (the earlier versions are C4.5 [19] and 
ID3 [18]). The training set are the first 200 cases of the WBC Data. The tree 
indicates that the first attribute to be tested is attribute A 2 and if this is greater 
than 1 then attribute A 3 should be inspected next. If the case exhibits a value 
greater than 2, the case is to be classified as malignant. A total of 88 would end 
up at this leaf, but although the leaf classifies them as malignant, 7 training 
cases arriving here are benign. 

There are many variants of decision trees in the statistical [4] and the machine 
learning literature [19,22]. Decision trees are interesting because the classifiers 
can be associated with logic rules. That is, decision trees give an insight into 
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96 

[0.990] 


3 


Uniformity of Cell Size > 1 
A Single Epithelial Ceil Size > 2 
^ maliniant 


78 

[0.938] 


4 


Uniformity of Cell Size > 1 
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88 

[0.911] 


L 


DEFAULT =>• benign 





Fig. 1 - Decision tree and rules constructed by C5 with 200 cases. Attribute A 2 is 
^Uniformity of Cell Size*, A 3 is ‘Uniformity of Cell Shape’ and As is ‘Single Epithelial 
CeU Size’. 



dependencies between attribute values and classes and allow the production 
of explicit knowledge in a form amendable to human understanding; they also 
facilitate construction of SQL expressions for efficient record retrieval from large 
databases. This is harder for statistical techniques like linear discriminants or 
connectionist approaches like neural networks that encode the model in learned 
real valued coefficients and numerical forms. For example, for the tree in Fig. 1, 
C5 produces the rules presented in Fig. 1. In order to classify a case, all ^ rules 
whose antecedent matches the case are selected and a vote is taken according to 
the confidence factors. If no rule applies the default rule is used. The rules also 
indicate the number of cases in the training set that match their antecedent. 

Decision trees and their logic rules help understand the data. First, they 
indicate which are the attributes that have most influence in determining the 
class. In Fig. 1, not all 9 attributes are used but only 3. Also, the rules express 
patterns. In this example, large values in the uniformity of cell size indicate 
malignant diagnosis while small values in the uniformity of cell size are strong 
indication of benign diagnosis. 

4 Our Proposal 

We now propose algorithms to ensure confidentiality, up to partial disclosure, 
via noise addition. We use noise addition to construct a new training set which is 
released to the miner. The new training set is a perturbed version of the original 
training set so that the data miner may have access to individual records but 
will have uncertainty that the given class is accurate. We release original values, 
but the original assignment of cases to classes is never made public. However, 
the miner should be able to obtain general patterns in the new training set that 
are a reflection of patterns in the original training set. Note that we take the 



^ Note that in general there may be more than one applicable rule (for example, 
because the case is missing the value for some tested attribute). 
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approach that KDDM is an exploratory, rather than confirmatory, approach to 
data understanding. As opposed to statistical data analysis, the miner does not 
aim at obtaining a definite, unbiased statistical test that answers with a prob- 
abilistic degree of confidence if the data fits a preconceived statistical model. 
More simply, KDDM is not about hypothesis testing but about generation of 
plausible hypothesis. Of course the hypothesis generated by the miner will re- 
quire formal statistical analysis to ensure their validity and significance. KDDM 
should provide an eflBcient filter in the vast universe of hypotheses. 

We constructs a new training set by building decision trees. Intuitively, the 
process of building a tree expands a node by the most informative attribute 
that splits its cases maximising homogeneity of the class labels in the children. 
This greedy approach tends to overfit the data; and thus, typically, the process 
is halted although leaves may not be homogeneous. We first observe that if we 
randomly permute the class labels within the cases of each heterogeneous leaf, 
the new tree would classify training cases and new cases as the original tree. For 
example, in the tree of Fig. 1, the second leaf is heterogeneous with 7 cases with 
the label benign and 81 cases with the label malignant. So if we strip the label 
from the attribute vectors, randomly permute the 88 class labels and reattach 
them to the cases, we still have a node with a majdrity of malignant cases. All 
training cases will follow the same path and will be classified the same as in the 
original tree. At this stage the tree has not changed. 

A few things would be different. The tree built on the 200 training cases (with 
the original class labels) makes 7 mistakes out of 200 (they are false positives, 
that is, the tree labels 7 cases as malignant while they are in fact benign) This is 
an error rate of 3.5% in the training set. However, the same decision tree (with 
the labels permuted as suggested) will exhibit an error rate between 3.5% and 
7% in the new training set. Thus, if the miner ever builds the tree of Fig. 1, the 
miner will potentially observe a larger error rate. 

However, our proposal is that we need to trade-off accuracy in classification 
for confidentiality. Namely, the only way we can guarantee that the snooper never 
finds a case x where the class f(x) is known with certainty (total disclosure) is 
if the models built from the data have less predictive accuracy in classification. 
Note that we expect miners not to use the models they built for classification, 
but for data understanding and for discovering trends and patterns. Thus, the 
real question is, how different are the rules generated and the patterns observed 
if different induction algorithms are used in the new training set? For example, 
what does C5 build when it only sees the new training set of 200 cases for 
the WBC data? Note that, at least 186 are the same original cases (at most 7 
labels get swapped creating at most 14 attribute vectors where the new label is 
different from the original one). But because we do not release the tree of Fig. 1, 
the miner does not know which cases have swapped label and can not have 100% 
certainty that any case in the new training set has its original class attached to 
it. Since, at least 186 cases are the same, and potentially on average even more, 
the cases in the new training set have a high probability of having their original 
class label. Here again the trade-off is apparent. 
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5 Case study 

Our experiment is aimed at identifying if the induction of rules by C5 or others 
on the perturbed training set can still find the general patterns in the original 
data. We use the full WBC Database that holds 699 records. The miner does not 
see the original tree. Because the swapping is random, we repeat the generation 
of the new training set 10 times. For reasons that we explain later, we originally 
expected that trees built out of the perturbed training sets would be larger than 
the original tree, and also that there would be more specific rules. However, this 
effect is very much corrected when the data sets are larger. While the tree built 
by C5 with the original 699 cases has 16 leaves and produces a rule set of 8 
rules and a default rule, the average tree size produced from the ten perturbed 
training sets is 13 ( ±2.3 with a 95% ). Thus, it is reasonable to expect that 
tree size is preserved by our methods as the data sets get leurger. The rule set 
was slightly larger, the average was 9 ( ±1.2 with a 95% confidence). We should 
point out that 7 out of 16 leaves in the original tree have cases from both classes 
and the maximum data swap is 22 cases. 

For important knowledge discovery aspects, like most relevant attributes, 
we note that 9 out of 10 trees built from the independent perturbed training 
sets used ‘Uniformity of Cell Size’ at the root, and thus, ranking it as the most 
informative attribute for classification. Typically, nodes one level deep preserved 
the attribute selected, however, ‘Uniformity of Cell Shape’ was pushed further 
down or not used in 30% of the new trees. 

We say that an attribute is identified as relevant if it appears in the tree, 
and thus, in the rules. With respect to identification of relevant attributes, the 
original tree used 7 of the 10 possible attributes. Six of these 7 attributes were 
identified as relevant most of the time. In fact, in all of the 10 new decision 
trees, 4 of these seven attributes were identified as relevant. One attribute, was 
identified 9 times an only in one new tree it was not used. We already mentioned 
that the attribute ‘Uniformity of Cell Shape’ was dropped 3 out of 10 times. One 
attribute, ‘Single Epithelial Cell Size’, was used in only 2 of the ten new trees. 
However, in the original tree, this attribute is only used once and at depth 4. 
Also the attribute that was dropped 3 times was used only once in the tree (note 
that numerical attributes can be used more than once by further divisions of the 
domain). Thus, all attributes that were used twice or more in the original tree 
and thus, they are even more relevant, are preserved by our swapping approach. 

Now we analyse the rules. Table 1 shows the 4 rules that appeared exactly or 
in very similar format in the rule sets generated from the 10 perturbed data sets. 
A rule was classified as very similar if the same attributes were tested and the 
bounds tested were no more than 1 away or if the negated rule was generated 
and was similar. The other 4 original rules were also generated, but not out of 
each of the 10 perturbed training sets. Perhaps indicating less general patterns. 
Table 1 also shows rules that emerged in at least 70% of the new sets of rules. 

Thus, even in this smaller sets, C5 was able to recuperate the pattern that 
small values in the ‘Uniformity of Cell Size’ and the ‘Single Epithelial of the Cell 
Size’ are strong indication of benign diagnosis while large values of this attributes 
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1 Fully preserved rules | 


Uniformity of Cell Size < 2 
A Clump Thickness < 3 =>• benign 


(cover 271) 
[0.996] 


Uniformity of Cell Size < 4 

A Marginal Adhesion < 3 A Bare Nuclei < 2 benign 


(cover 406) 
[0.990] 


Clump Thickness > 3 A Bare Nuclei > 3 
A Bland Chromatin > 2 ^ malignant 


(cover 186) 
[0.952] 


Clump Thickness > 5 

A Uniformity of Cell Size > 2 ^ malignant 


(cover 165) 

[0.946] 


1 Preserved rules in 70% of perturbations | 


Uniformity of Cell Size > 2 A Uniformity of Cell Shape > 2 
A Bare Nuclei > 2 =► malignant 


(cover 211) 
[0.948] 



Table 1. List of rules preserved by leaxning from perturbed trainiug sets. 



indicate malignant diagnosis. We consider this results remarkable given the fact 
that decision tree induction is know to be very brittle [18]. That is, slightly 
perturbed training sets, or noise in the data, produces very different decision 
tress and sets of rules. 

The second question we investigated is what happens if the miner does not 
use C5 to analyse the published data set, but some other method to induce 
logic rules. First, we used CN2 [5] to induce logic rules form the original West 
Cancer dataset and our 10 perturbed data sets. Space is insufficient to present the 
detailed comparison. Nevertheless, we are very pleased to observe that mining 
from both (perturbed data sets as well as the original dataset) resulted in the 
identification of ‘Clump Thickness’ and ‘Uniformity of Cell Size’ as very relevant 
attributes (always tested first or second). Also mining form the original or from 
perturbed datasets resulted in ‘Single Epithelial Cell Size’, ‘Large Bare Nuclei’ 
and ‘Marginal Adhesion’ as relevant. There was also coincidence in the cut- 
off values where these attributes split the benign and malignant classes. We 
were impressed that CN2 labelled ‘Clump Thickness’ as very relevant above 
‘Uniformity of Cell Size’ while C5 never placed this attribute at the root of its 
trees. Thus, we see these results as a virtue of our security mechanism. It does 
not obscure what are the preferences (biases) of particular inductive methods. 
What CN2 finds most relevant is preserved in the perturbed training sets. 

However, CN2 is more brittle to noise and while the number of rules generated 
in the original data was 19, the average number of rules among the 10 perturbed 
datasets was 31 (with a 95% confidence interval of ±2. We see this as a problem 
of CN2 and that decision trees are affected by noise [14]. 

Next, we used EVOPROL [9], a genetic programming tolls for inducing clas- 
sification rules. Again, the rules and pattern discovered in the perturbed sets 
were analogous and parallel to those obtained in the original set. Once again, 
this method considered ‘Clump Thickness’ more relevant than ‘Uniform Cell 
Size’ in both (original and perturbed datasets), using it more often and earlier 
in the rules. The impact of noise in the length of classifiers was much less evi- 
dent. Thus, we see that the miner is free to use the induction mechanism and 
obtains very similar results as if it had been provided the original data set. 
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6 Discussion 

Our experiment shows the merits of our proposed method for ensuring partial 
disclosure while allowing a miner to explore det^led data. Baring in mind that 
data miners are primarily interested in general patterns, and not so much in 
obtaining unbiased statistical indicators, we have proposed a low-cost security 
control mechanism. Our data swapping consists of finding a K-transformation 
of a given database D, but we relax the condition that D and the new O' 
have no records in common. The swapping is performed over the confidential 
attribute only (in this article, the confidential attribute is the class label C), as 
a random shuffling within heterogeneous leaves of the decision tree. Clearly, all 
the statistics which do not involve this attribute will be preserved. Simileurly, the 
statistics that involve the confidential attribute and whose query sets are defined 
by internal nodes or homogeneous leaves of a decision tree will also be preserved. 
Since the heterogeneous leaves have a vast majority of records belonging to a 
single class and no straightforward way for further splitting, we can argue that 
the most seriously disturbed statistics will be those that involve a small number 
of records, and have no obvious impact on the function / that is to be learned. 
Furthermore, we can balance the statistical precision against the security level 
by choosing to perform the swapping in the internal nodes, rather than in the 
leaves of the decision tree: the closer to the root, the higher the security but 
lower the precision. 

Finally we would like to comment that data size also plays a role. For exam- 
ple, if we use the first 200 cases of the breast-cancer database, our methods leads 
to new training sets that ensure partial disclosure, but the decision trees built 
by the miner are larger (8 leaves and 7 rules seems to be the expected vcilues). 
This seems to be due to several aspects. First, C5 is trying to minimise expected 
accuracy in new cases inducing from a perturbed training set. So, in learning 
from the new training set, it overfits such small new training set and needs to 
produce larger trees and more rules. For example, a prototypic malignant case 
that would land on the second leaf of the tree in Fig. 1 may have its label changed 
to benign by our noise addition. So, if this case (with swapped label) appears in 
the new training set, C5 will require a very specific additional rule or a deeper 
path in the tree to carve it out when surrounded from many similar malignant 
cases. In fact, some rules generated by C5 from the new training set have a very 
small cover, indicating to the miner (snooper) that this are potentially outliers 
whose label our method has swapped. However, the miner would still have no 
absolute certainty. 
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